Saturday, March 07, 2015

RDF meets Solr: S P O C Faceting

SolRDF is a Solr extension for indexing and searching RDF data.
In a preceding post I explained how to set-up SolRDF in two minutes, leveraging Maven for automatically building and installing the whole stuff.

Once installed, you can index data by issuing a command like this:

> curl -v http://localhost:8080/solr/store/update/bulk?commit=true -H "Content-Type: application/n-triples" --data-binary @/path-to-your-data/data.nt

and then, you can execute a SPARQL query in this way:

> curl "http://127.0.0.1:8080/solr/store/sparql"   --data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10"   -H "Accept: application/sparql-results+xml"

Now, since the whole stuff is running within a full text search engine, why don't we try to combine some of the cool features of Solr with SPARQL results?

The underlying idea is: SPARQL results serialization is standardized in several W3C documents and therefore cannot be changed. We need a way to embed those results in another response that will contain additional information like metadata, facets and so on.

Solr query response sounds perfect to accomplish this goal: I have only to replace the <result> section with a <sparql> document (note I'm specifically talking about the XML response writer, I implemented only this writer at this moment; other formats are coming in the next episodes...). Running a query like this

/sparql?facet=true&facet.field=p&start=100&rows=10&q=SELECT * WHERE {?s ?p ?o}

I can get the following response (note the mix between SPARQL and Solr results):

<response>
    <lst name="responseHeader">
        <int name="status">0
        <int name="QTime">31
        <str name="query">SELECT  * WHERE{ ?s ?p ?o}
    </lst>
    <result name="response" numFound="3875" start="100" maxScore="1.0">
        <sparql>
            <head>
                <variable name="s"/>
                <variable name="p"/>
                <variable name="o"/>
            </head>
            <results>
                <result>
                    <binding name="s">
                        <uri>http://example/book2
                    </binding>
                    ...
                </result>
                ...
        </results>
    </sparql>
    </result>
    <lst name="facet_counts">
        <lst name="facet_queries"/>
        <lst name="facet_fields">
            <lst name="p">
                <int name="<http://example.org/ns#price>">231</int>
                <int name="<http://purl.org/dc/elements/1.1/creator>">1432</int>
                <int name="<http://purl.org/dc/elements/1.1/title>">2212</int>
            </lst>
          </lst>
        <lst name="facet_dates"/>
        <lst name="facet_ranges"/>
   </lst>
</response>  

The first question is: what does trigger that hybrid search? I would like to maintain the standard SPARQL endpoint functionality so a good compromise could be the following:
  • if the query string contains only a q parameter then the plain SPARQL endpoint will execute the query. It will return a standard SPARQL-Result response;
  • if the query string contains also other parameters (at the moment I considered only the facet, facet.field, rows and start parameters) then a hybrid search will be executed, therefore providing results in the mixed mode listed above.
As usual, any feedback is warmly welcome!