Sunday, November 08, 2015

Shuffling top results in Solr with query re-ranking

You built a cool e-commerce portal on top of Apache Solr; brands and shops are sending you their data in CSV and you index everything with a little effort, just a matter of few commands (more than one as the content of each CSV slightly changes between sources).

Now it's search time but...yes, there's a but: sometimes, the first top results (10, 20, 30 or more) belong to the same shop (or the same brand), even if other shops (or brands) have that kind / type of product.

For instance, a search for "shirt"
  • returns 5438 results in 109 pages (60 results / page)
  • the first 118 results (the first two pages) belong to the "C0C0BABE" brand
  • starting from the 119 result, other brands appear
This could be a problem, because sooner or later other brands will complain about that "hiding" issue: the impression rate of the third page is definitely lower than the first page. As consequence of that, it seems like your website is selling only items from "C0C0BABE".

What can we do? Results need to be sorted by score, another criterion would necessarily compromise the computed relevancy.

Well, in this scenario, I discovered the Query Re-Ranking [1] capability of Solr; I know, it is not a new feature, it has been introduced in Solr very long time ago...I never met before a scenario like this ("Mater artium necessitas")

From the official Solr Reference Guide:

"Query Re-Ranking allows you to run a simple query (A) for matching documents and then re-rank the top N documents using the scores from a more complex query (B). Since the more costly ranking from query B is only applied to the top N documents it will have less impact on performance then just using the complex query B by itself"

The component interface is very simple. You need to provide three parameters:
  • reRankQuery: this is the query that will be used for re-ranking;
  • reRankDocs: the (minimum) number of top N results to re-rank; Solr could increase that number during the re-ranking
  • reRankWeight: a multiplicative factor applied to the score of the documents matching the reRankQuery and, at the same time, belonging to the top reRankDocs set. For each of them, that additional score will be added to the original score of the document (i.e. the score resulting from the main query)
Cool! But the actual question was: what about the reRankQuery?? I should emulate a random behaviour, like random querying a field with a non-structured content. At the end that has been exactly what I did: I saw in the schema a non-structured field, the product description, which contains free text.

Then I created a copy of such field in another searchable "shuffler" (Text)field , with a minimum text analysis (standard tokenization, lowercasing,  word delimiter):
<field name="shuffler" type="unstemmed-text" indexed="true" .../>
<copyField src="prd_descr" dest="shuffler"/>
As last thing, I configured the request handler with the re-rank parameters as follow:
<str name="rqq">
    {!lucene q.op=OR df=shuffler v=$rndq}
<str name="rq">
     {!rerank reRankQuery=$rqq reRankDocs=220 reRankWeight=1.2}
As you can see I'm using a plain Solr query parser for executing a search on the "shuffler" field mentioned above. What about the $rndq parameter? That is the query, which should contain a (probably long) list of terms. I defined a default value like this:
<str name="rndq">
    (just top bottom button style fashion up down chic elegance ... )
What is the goal here? The default operator of the query parser has been set to OR so the reRankQuery will give a chance to the first reRankDocs to collect an additional "bonus" score if their shuffler field contains one or (better) more terms provided in the $rdnq parameter.

The default value, of course, will be always the same, but a client could provide an its own $rndq parameter with a list of terms different for each request.

For the other parameters (reRankWeight and reRankDocs) those are the values that work for should run some test with your dataset and try / adjust them.

The overall stuff is not precise, is not so deterministic...but it works ;)



No comments: