Tuesday, March 29, 2016

Randomizing top-n results in Solr

So, after shuffling a bit [1] the top-n search results returned by Solr, you may want to effectively randomize them in a *non-repeatable* way. Why? I don't know...I'm just enjoying some coding experiment while I'm travelling :)

What I want to do is: run a query and (pseudo) randomly reorder the first top results. I will be using again the query reranking feature, but this time, I need a re-ranking query that produces different results each time is invoked.

I created a simple function [2] (i.e. a ValueSourceParser plus a ValueSource subclasses) that is based on a (threaded-local) java.util.Random instance which simply returns a (pseudo) random number each time it is invoked.

Once the two classes have been packed in a jar, put under the lib folder and configured in solrconfig.xml with the name rnd:

<valueSourceParser name="rnd" class="com.faearch.search.function.RandomValueSourceParser"/>

I only need to use it in a re-rank query using the boost parser:

<requestHandler ...>
    <str name="rqq">{!boost b=rnd() v=$q}</str>
    <str name="rq">{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=1.2}</str>
...

You can now start Solr, index some document, run several times the same query (by default ordered by score) and see what happens.  Don't forget to include the score in the field list (fl) parameter; in this way you will see the concrete effect of the multiplicative random boost:

http://...?q=shoes&fl=score,*

1st time

<result name="response" numFound="2" start="0" maxScore="0.32487732">  <doc>
 <str name="product_name">shoes B</str>
 <float name="score">0.32487732</float></doc>
 <doc>
 <str name="product_name">shoes A</str>
 <float name="score">0.22645184</float></doc>
</result>

2nd time (ooops that's the same order...don't worry, it's the randomness, and I indexed only 2 docs, see the score value, which is different from the previous example)

<result name="response" numFound="2" start="0" maxScore="0.61873287">  <doc>
 <str name="product_name">shoes B</str>
 <float name="score">0.61873287</float></doc>
 <doc>
 <str name="product_name">shoes A</str>
 <float name="score">0.3067757</float></doc>
</result>
  
3rd time

<result name="response" numFound="2" start="0" maxScore="0.24988756">  <doc>
 <str name="product_name">shoes A</str>
 <float name="score">0.24988756</float></doc>
 <doc>
 <str name="product_name">shoes B</str>
 <float name="score">0.22548665</float></doc>
</result>

See you next time ;)

[1] http://andreagazzarini.blogspot.it/2015/11/shuffling-top-results-with-query-re.html
[2] https://gist.github.com/agazzarini/a802eff3b50c03fae2364458719be94e

Sunday, November 08, 2015

Shuffling top results in Solr with query re-ranking

You built a cool e-commerce portal on top of Apache Solr; brands and shops are sending you their data in CSV and you index everything with a little effort, just a matter of few commands (more than one as the content of each CSV slightly changes between sources).

Now it's search time but...yes, there's a but: sometimes, the first top results (10, 20, 30 or more) belong to the same shop (or the same brand), even if other shops (or brands) have that kind / type of product.

For instance, a search for "shirt"
  • returns 5438 results in 109 pages (60 results / page)
  • the first 118 results (the first two pages) belong to the "C0C0BABE" brand
  • starting from the 119 result, other brands appear
This could be a problem, because sooner or later other brands will complain about that "hiding" issue: the impression rate of the third page is definitely lower than the first page. As consequence of that, it seems like your website is selling only items from "C0C0BABE".

What can we do? Results need to be sorted by score, another criterion would necessarily compromise the computed relevancy.

Well, in this scenario, I discovered the Query Re-Ranking [1] capability of Solr; I know, it is not a new feature, it has been introduced in Solr very long time ago...I never met before a scenario like this ("Mater artium necessitas")

From the official Solr Reference Guide:

"Query Re-Ranking allows you to run a simple query (A) for matching documents and then re-rank the top N documents using the scores from a more complex query (B). Since the more costly ranking from query B is only applied to the top N documents it will have less impact on performance then just using the complex query B by itself"

The component interface is very simple. You need to provide three parameters:
  • reRankQuery: this is the query that will be used for re-ranking;
  • reRankDocs: the (minimum) number of top N results to re-rank; Solr could increase that number during the re-ranking
  • reRankWeight: a multiplicative factor applied to the score of the documents matching the reRankQuery and, at the same time, belonging to the top reRankDocs set. For each of them, that additional score will be added to the original score of the document (i.e. the score resulting from the main query)
Cool! But the actual question was: what about the reRankQuery?? I should emulate a random behaviour, like random querying a field with a non-structured content. At the end that has been exactly what I did: I saw in the schema a non-structured field, the product description, which contains free text.

Then I created a copy of such field in another searchable "shuffler" (Text)field , with a minimum text analysis (standard tokenization, lowercasing,  word delimiter):
<field name="shuffler" type="unstemmed-text" indexed="true" .../>
<copyField src="prd_descr" dest="shuffler"/>
As last thing, I configured the request handler with the re-rank parameters as follow:
<str name="rqq">
    {!lucene q.op=OR df=shuffler v=$rndq}
</str>
<str name="rq">
     {!rerank reRankQuery=$rqq reRankDocs=220 reRankWeight=1.2}
</str> 
As you can see I'm using a plain Solr query parser for executing a search on the "shuffler" field mentioned above. What about the $rndq parameter? That is the query, which should contain a (probably long) list of terms. I defined a default value like this:
<str name="rndq">
    (just top bottom button style fashion up down chic elegance ... )
</str> 
What is the goal here? The default operator of the query parser has been set to OR so the reRankQuery will give a chance to the first reRankDocs to collect an additional "bonus" score if their shuffler field contains one or (better) more terms provided in the $rdnq parameter.

The default value, of course, will be always the same, but a client could provide an its own $rndq parameter with a list of terms different for each request.

For the other parameters (reRankWeight and reRankDocs) those are the values that work for me...you should run some test with your dataset and try / adjust them.

The overall stuff is not precise, is not so deterministic...but it works ;)

-----------------------------------------------------

[1] https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking 
   

Saturday, October 17, 2015

How to do Integration tests with Solr 5.x

Please keep in mind that what is described below is valid only if you have a Solr instance with a single core. Thanks to +Alessandro Benedetti for alerting me on this sneaky stuff ;) 
I recently migrated a project [1] from Solr 4.x to Solr 5.x (actually Solr 5.3.1),  and the only annoying stuff has been a (small) refactoring of my integration test suite.

Previously, I always used the cool Maven Cargo plugin for running and stopping Solr (ehmm Jetty with a solr.war deployed in) before and after my suite. For those who are still using Solr 4.x here [2] there's the configuration. It is just a matter of a single command:

> maven install

Unfortunately, Solr 5.x is no longer (formally) a web application, so I'd need to find another way to run the integration suite. After googling a bit I wasn't able to find a solution so I asked to myself : "How do Solr folks run their integration tests?" and I find this artifact [3] on the Maven repository: solr-test-framework..."well, the name sounds good", I said. 

Effectively, I found a lot of already built things that do a lot of stuff for you. In my case, I only had to change a bit my integration suite superclass; actually a simple change because I had to extend from org.apache.solr.SolrJettyTestBase.

This class provides methods for starting and stopping Jetty (yes, still Jetty because even formally Solr is no longer a JEE web application, actually it still is, and it comes bundled with a Jetty, which provides the HTTP connectivity). Starting the servlet container in your methods is up to you, by means of the several createJetty(...) static methods. Instead, that class provides a @BeforeClass annotated method which stops Jetty at the end of the execution, of course in case it has been previously started.

You can find my code here [4], any feedback is warmly welcome ;) 

--------------------------
[1] https://github.com/agazzarini/SolRDF
[2] pom.xml using the Maven Cargo plugin and Solr 4.10.4
[3] http://mvnrepository.com/artifact/org.apache.solr/solr-test-framework/5.3.1
[4] SolRDF test superclass using the solr-test-framework

Sunday, June 07, 2015

Towards a scalable Solr-based RDF Store

SolRDF (i.e. Solr + RDF) is a set of Solr extensions for managing (index and search) RDF data.



In a preceding post I described how to quickly set-up a standalone SolRDF instance in two minutes; here, after some work more or less described in this issue, I'll describe in few steps how to run SolRDF in a simple cluster (using SolrCloud). The required steps are very similar to what you (hopefully) already did for the standalone instance. 

All what you need  

  • A shell  (in case you are on the dark side of the moon, all steps can be easily done in Eclipse or whatever IDE) 
  • Java 7
  • Apache Maven (3.x)
  • Apache Zookeeper  (I'm using the 3.4.6 version)
  • git (optional, you can also download the repository from GitHub as a zipped file)


Start Zookeeper 


Open a shell and type the following

# cd $ZOOKEPER_HOME/bin
# ./zkServer -start

That will start Zookeeper in background (start-foreground for foreground mode). By default it will listen on localhost:2181

Checkout SolRDF


If it is the first time you hear about SolRDF you need to clone the repository. Open another shell and type the following:

# cd /tmp
# git clone https://github.com/agazzarini/SolRDF.git solrdf-download

Alternatively, if you've already cloned the repository you have to pull the latest version, or finally, if you don't have git, you can download the whole repository from here.

Build and Run SolRDF nodes


For this example we will set-up a simple cluster consisting of a collection with two shards.

# cd solrdf-download/solrdf
# mvn -DskipTests \
    -Dlisten.port=$PORT \
    -Dindex.data.dir=$DATA_DIR \
    -DskipTests \
    -Dulog.dir=ULOG_DIR \
    -Dzk=ZOOKEEPER_HOST_PORT \
    -Pcloud \
    clean package cargo:run

Where
  • $PORT is the hosting servlet engine listen port;
  • $DATA_DIR is the directory where Solr will store its datafiles (i.e. the index)
  • $ULOG_DIR is the directory where Solr will store its transaction logs.
  • $ZOOKEEPER_HOST_PORT is the Zookeeper listen address (e.g. localhost:2181)
The very first time you run this command a lot of things will be downloaded, Solr included. At the end you should see something like this:

[INFO] Jetty 7.6.15.v20140411 Embedded started on port [8080]
[INFO] Press Ctrl-C to stop the container...

the first node of SolRDF is up and running! 

(The command above assume the node is running on localohost:8080)

The second node can be started by opening another shell and re-executing the command above

# cd solrdf-download/solrdf
# mvn -DskipTests \
    -Dlisten.port=$PORT \
    -Dindex.data.dir=$DATA_DIR \
    -DskipTests \
    -Dulog.dir=ULOG_DIR \
    -Pcloud \
    cargo:run

Note:
  • "clean package" options have been omitted: you've already did that in the previous step
  • you need to declare different parameters values (port, data dir, ulog dir) if you are on the same machine
  • you can use the same parameters values if you are on a different machine
If you open the administration console you should see something like this:



(Distributed) Indexing


Open another shell and type the following (assuming a node is running on localhost:8080):

# curl -v http://localhost:8080/solr/store/update/bulk \
    -H "Content-Type: application/n-triples" \
    --data-binary @/tmp/solrdf-download/solrdf/src/test/resources/sample_data/bsbm-generated-dataset.nt 


Wait a moment...ok! You just added 5007 triples! They've been distributed across the cluster: you can see that by opening the administration consoles of the participating nodes. Selecting the "store" core of each node, you can see how many triples have been assigned to that specific node.



Querying


Open another shell and type the following:

# curl "http://127.0.0.1:8080/solr/store/sparql" \
  --data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10" \
  -H "Accept: application/sparql-results+json"
...  

# curl "http://127.0.0.1:8080/solr/store/sparql" \
  --data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10" \
  -H "Accept: application/sparql-results+xml"
...

  In the examples above I'm using only (for indexing and querying) the node running on localhost:8080 but you can send the query to any node in the cluster. For instance you can re-execute the query above with the other node (assuming it is running on localhost:8081):

# curl "http://127.0.0.1:8081/solr/store/sparql" \
  --data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10" \
  -H "Accept: application/sparql-results+json"
...  


You will get the same results.

Is that ready for a production scenario? No, absolutely not. I think a lot needs to be done on indexing and querying optimization side. At the moment only the functional side has been covered: the integration test suite includes about 150 SPARQL queries (ASK, CONSTRUCT, SELECT and DESCRIBE) and updates (e.g. INSERT, DELETE) taken from the LearningSPARQL book [1], that are working regardless the target service is running as a standalone or clustered instance.

I will run the first benchmarks as soon as possible but honestly at the moment I don't believe I'll see high throughputs.

Best,
Andrea

[1] http://www.learningsparql.com

Sunday, April 19, 2015

RDF Faceting with Apache Solr: SolRDF

"Faceted search, also called faceted navigation or faceted browsing, is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters."
(Source: Wikipedia)

Apache Solr built-in faceting capabilities are nicely described in the official Solr Reference Guide [1] or in the Solr Wiki [2].
In SolRDF, due to the nature of the underlying data, faceted search assumes a shape which is a bit different from traditional faceting over structured data. For instance, while in a traditional Solr schema we could have something like this:

<field name="title" .../>
<field name="author" .../>

<field name="publisher" .../>
<field name="publication_year" .../>
<field name="isbn" .../>
<field name="subject" .../>
...


In SolRDF data is always represented as a sequence of triples, that is, a set of assertions (aka statements), representing the state of a given entity by means of three / four compounding members: a subject, a predicate, an object and an optional context. The holding schema, which is described better in a dedicated section of this Wiki, is, simplifying, something like this:

<!-- Subject -->
<field name="s" .../>

<!-- Predicate -->
<field name="p" .../>
 
<!-- Object -->
<field name="o" .../>

A "book" entity would be represented, in RDF, in the following way:

<#xyz>
    dc:title "La Divina Commedia" ;  
    dc:creator "Dante Alighieri" ;
    dc:publisher "ABCD Publishing";
    ...


A faceted search makes sense only when the target aggregation field or criteria leads to a literal value, a number, something that can be aggregated. That's the reason you will see, in a traditional Solr that indexes books, a request like this: 

facet=true 
&facet.field=year 
&facet.field=subject  
&facet.field=author
 
In the example above, we are requesting facets for three fields: year, subject and author.

In SolRDF we don't have such "dedicated" fields like year or author, but we always have s, p, o and an optional c. Faceting on those fields, although perfectly possible using plain Solr facet fields (e.g. facet.field=s&facet.field=p), doesn't make much sense because they are always URI or blank nodes.

Instead, the field where faceting reveals its power is the object. But again, asking for plain faceting on o field (i.e. facet.field=o), will result in a facet that aggregates apples and bananas: each object represents a different meaning, it could have a different domain and data-type. We need a way to identify a given range of objects.

In RDF, what determines the range of the object of a given triple, is the second member, the predicate. So instead of indicating what is the target field of a given facet, we will indicate a query that selects a given range of objects values. An example will be surely more clear.
Solr (field) faceting:

facet=true&facet.field=author

SolRDF (field) faceting:

facet=true&facet.field.q=p:<http://purl.org/dc/elements/1.1/creator>

The query will select all objects having an author as value, and then faceting will use those values. The same concept can be applied to range faceting.

Facet Fields


Traditional field faceting is supported on SolRDF: you can have a field (remember: s,p,o or c) to be treated as a facet by means of the facet.field parameter. All other parameters described in the Solr Reference Guide [1] are supported. Some examples: 

Ex #1: field faceting on predicates with a minimum count of 1

 

q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.field=p 
&facet.mincount=1   

Ex #2: field faceting on subjects and predicates with a different minimum count


q=SELECT * WHERE { ?s ?p ?o }   
&facet=true   
&facet.field=p 
&facet.field=s 
&
f.s.facet.mincount=1  
&f.p.facet.mincount=10   

Ex #3: field faceting on predicates with a prefix (Dublin Core namespace) and minimum count constraints

 

q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.field=p 
&facet.prefix=<http://purl.org/dc
   

Object Queries Faceting


Facet field queries have basically the same meaning of facet fields: the only difference is that, instead on indicating a target field, faceting is always done on the o(bject) field, and you can indicate, with a query, what are the objects that will be faceted. Some examples:

Ex #1: faceting on publishers


q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q=p:<http://purl.org/dc/elements/1.1/publisher>


Ex #2: faceting on names (creators or collaborators)


q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q=p:<http://purl.org/dc/elements/1.1/creator> p:<http://purl.org/dc/elements/1.1/collaborator>


Ex #3: faceting on relationships of a given resource


q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q=s:<http://example.org#xyz> p:<http://purl.org/dc/elements/1.1/relation>


The facet.field.q parameter can be repeated using an optional progressive number as suffix in the parameter name:  

Ex #4: faceting on creators and languages


q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q=p:<http://purl.org/dc/elements/1.1/creator> &facet.object.q=p:<http://purl.org/dc/elements/1.1/language>


or

q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q.1=p:<http://purl.org/dc/elements/1.1/creator>
&facet.object.q.2=p:<http://purl.org/dc/elements/1.1/language>
 
In this case you will get a facet for each query, keyed using the query itself:

<lst name="facet_counts">
 
<lst name="facet_fields">
  
<lst name="p:">
    
<int name="Ross, Karlint">12
     <int name="Earl, James">9
     <int name="Foo, John">9
     ...
   </lst>
  
<lst name="p:">
    
<int name="en">3445

     <int name="de">2958
     <int name="it">2865
     ...
   </lst>
 
</lst> 

</lst

The suffix in the parameter name is not required, but it is useful to indicate an alias for each query:

q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q.1=p:<http://purl.org/dc/elements/1.1/creator>

&facet.object.q.2=p:<http://purl.org/dc/elements/1.1/language>  
&facet.object.q.alias.1=author
&facet.object.q.alias.2=language 

The response in this case will be (note that each facet is now associated with the alias):

<lst name="facet_counts">
 
<lst name="facet_fields">
  
<lst name="author">
    
<int name="Ross, Karlint">12</int>
     <int name="Earl, James">9</int>
     <int name="Foo, John">9</int>
     ...
   </lst>
  
<lst name="language">
    
<int name="en">3445
</int>
     <int name="de">2958</int>
     <int name="it">2865</int>
     ...
   </lst>
 
</lst><</lst



Object Range Queries Faceting


Range faceting is described in the Solr Reference Guide [1] or in the Solr Wiki [2]. You can get this kind of facet on all fields that support range queries (e.g. dates and numerics).
A request like this:

facet.range=year
&facet.range.start=2000
&facet.range.end=2015
&facet.range.gap=1

will produce a response like this:


<lst name="facet_ranges">
   
<lst name="year">
     
<lst name="counts">
         
<int name="2000">3445</int>
         
<int name="2001">2862</int>
         
<int name="2002">2776</int>
         
<int name="2003">2865</int>
          ... 
      
</lst>     
      
<int name="gap">1</int>
      
<int name="start">2000</int>
      
<int name="end">2010</int>
   
</
lst>
    ...

 

As briefly explained before, with semi-structured data like RDF we don't have "year" or "price" or whatever strictly dedicated field for representing a given concept; we always have 3 or 4 fields:
  • a s(ubject)
  • a p(redicate)
  • an o(bject)
  • and optionally a c(ontext)
Requesting something like this:

facet.range=o

wouldn't work: we would mix again apples and bananas. In addition, without knowing in advance the domain of the target value (e.g. integer, double, date) how could we express a valid facet.range.start, facet.range.end and facet.range.gap?

Range faceting for s or p or c attributes doesn't make any sense at all because the corresponding URI datatype (i.e. string) doesn't support range queries.

In order to enable range faceting on SolRDF, the default FacetComponent has been replaced with a custom subclass that does something I called Objects Range Query Faceting, which is actually a mix between facet ranges and facet queries.
  • Facet because, of course, the final results are a set of facets
  • Object because faceting uses the o(bject) field
  • Range because what we are going to compute are facet ranges
  • Queries because instead of indicating the target attribute in request (by means of facet.range parameter), this kind of faceting requires a facet.range.q which is a query (by default parsed by the Solr Query Parser) that selects the objects (i.e. the "o" attribute) of all matching triples (i.e. SolrDocument instances) and then calculates the ranges on them.
In this way, we can issue a request like this:

facet.range.q=p:<http://a.b.c#start_year>
&facet.range.start=2000
&facet.range.end=2010
&facet.range.gap=1

or like this

facet.range.q=p:<http://c.d.e#release_date>  
&facet.range.start=2000-01-10T17:00:00Z 
&facet.range.end=2010-01-10T17:00:00Z
&facet.range.gap=+1MONTH

You can also have more than one facet.range.q parameter. In this case the facet response will look like this:

<lst name="facet_ranges">
   
<lst name="p:">
     
<lst name="counts">
         
<int name="2000">3445</int>
         
<int name="2001">2862</int>
         
<int name="2002">2776</int>
         
<int name="2003">2865</int>
          ...
      
</lst>
      
<int name="gap">1</int>
      
<int name="start">2000</int>
      
<int name="end">2010</int>
   
</lst>
   
<lst name="p:">
     
<lst name="counts">
         
<int name="2000-03-29T17:06:02Z">2516</int>
         
<int name="2001-04-03T21:30:00Z">1272</int>
          ...
      
</lst>       

      <int name="gap">+1YEAR</int>
      
<int name="start">2000-01-10T17:00:00Z</int>
      
<int name="end">2010-01-10T17:00:00Z
</int>
    </lst>
    ...


Aliasing is supported in the same way that has been described for Facet Objects Queries. The same request above with aliases would be:

facet.range.q.1=p:
&facet.range.q.alias.1=start_year_alias
&facet.range.q.hint.1=num <-- as="" default="" eric="" font="" is="" num="" optional="" the="" value="">

&facet.range.start.1=2000 
&facet.range.end.1=2010
&facet.range.gap.1=1
&facet.range.q.2=p:
&facet.range.q.alias.2=release_date_alias
&facet.range.q.hint.2=date
&facet.range.start.2=2000-01-10T17:00:00Z
&facet.range.end.2=2010-01-10T17:00:00Z
&facet.range.gap.2=+1MONTH


Note in the response the aliases instead of the full queries:


<lst name="facet_ranges">
    <lst name="start_year_alias">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
          ...
       </lst>
       <int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    <lst name="release_date_alias">
      <lst name="counts">
          <int name="2000-03-29T17:06:02Z">2516</int>
          <int name="2001-04-03T21:30:00Z">1272</int>
          ...
       </lst>       

      <int name="gap">+1YEAR</int>
       <int name="start">2000-01-10T17:00:00Z</int>
       <int name="end">2010-01-10T17:00:00Z</int>
    </lst>
    ...

Here you can find a sample response containing all facets described above.

You can find the same content of this post in the SolRDF Wiki [3]. As usual any feedback is warmly welcome!

-------------------------------------
[1] https://cwiki.apache.org/confluence/display/solr/Faceting
[2] https://wiki.apache.org/solr/SolrFacetingOverview
[3] https://github.com/agazzarini/SolRDF/wiki