Now it's search time but...yes, there's a but: sometimes, the first top results (10, 20, 30 or more) belong to the same shop (or the same brand), even if other shops (or brands) have that kind / type of product.
For instance, a search for "shirt"
- returns 5438 results in 109 pages (60 results / page)
- the first 118 results (the first two pages) belong to the "C0C0BABE" brand
- starting from the 119 result, other brands appear
What can we do? Results need to be sorted by score, another criterion would necessarily compromise the computed relevancy.
Well, in this scenario, I discovered the Query Re-Ranking [1] capability of Solr; I know, it is not a new feature, it has been introduced in Solr very long time ago...I never met before a scenario like this ("Mater artium necessitas")
From the official Solr Reference Guide:
"Query Re-Ranking allows you to run a simple query (A) for matching documents and then re-rank the top N documents using the scores from a more complex query (B). Since the more costly ranking from query B is only applied to the top N documents it will have less impact on performance then just using the complex query B by itself"
The component interface is very simple. You need to provide three parameters:
- reRankQuery: this is the query that will be used for re-ranking;
- reRankDocs: the (minimum) number of top N results to re-rank; Solr could increase that number during the re-ranking
- reRankWeight: a multiplicative factor applied to the score of the documents matching the reRankQuery and, at the same time, belonging to the top reRankDocs set. For each of them, that additional score will be added to the original score of the document (i.e. the score resulting from the main query)
Then I created a copy of such field in another searchable "shuffler" (Text)field , with a minimum text analysis (standard tokenization, lowercasing, word delimiter):
<field name="shuffler" type="unstemmed-text" indexed="true" .../>As last thing, I configured the request handler with the re-rank parameters as follow:
<copyField src="prd_descr" dest="shuffler"/>
<str name="rqq">As you can see I'm using a plain Solr query parser for executing a search on the "shuffler" field mentioned above. What about the $rndq parameter? That is the query, which should contain a (probably long) list of terms. I defined a default value like this:
{!lucene q.op=OR df=shuffler v=$rndq}
</str>
<str name="rq">
{!rerank reRankQuery=$rqq reRankDocs=220 reRankWeight=1.2}
</str>
<str name="rndq">What is the goal here? The default operator of the query parser has been set to OR so the reRankQuery will give a chance to the first reRankDocs to collect an additional "bonus" score if their shuffler field contains one or (better) more terms provided in the $rdnq parameter.
(just top bottom button style fashion up down chic elegance ... )
</str>
The default value, of course, will be always the same, but a client could provide an its own $rndq parameter with a list of terms different for each request.
For the other parameters (reRankWeight and reRankDocs) those are the values that work for me...you should run some test with your dataset and try / adjust them.
The overall stuff is not precise, is not so deterministic...but it works ;)
-----------------------------------------------------
[1] https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking