Captain Gazza's Log...: RDF meets Solr: S P O C Faceting

SolRDF is a Solr extension for indexing and searching RDF data.
In a preceding post I explained how to set-up SolRDF in two minutes, leveraging Maven for automatically building and installing the whole stuff.

Once installed, you can index data by issuing a command like this:

> curl -v http://localhost:8080/solr/store/update/bulk?commit=true -H "Content-Type: application/n-triples" --data-binary @/path-to-your-data/data.nt

and then, you can execute a SPARQL query in this way:

> curl "http://127.0.0.1:8080/solr/store/sparql"   --data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10"   -H "Accept: application/sparql-results+xml"

Now, since the whole stuff is running within a full text search engine, why don't we try to combine some of the cool features of Solr with SPARQL results?

The underlying idea is: SPARQL results serialization is standardized in several W3C documents and therefore cannot be changed. We need a way to embed those results in another response that will contain additional information like metadata, facets and so on.

Solr query response sounds perfect to accomplish this goal: I have only to replace the <result> section with a <sparql> document (note I'm specifically talking about the XML response writer, I implemented only this writer at this moment; other formats are coming in the next episodes...). Running a query like this

/sparql?facet=true&facet.field=p&start=100&rows=10&q=SELECT * WHERE {?s ?p ?o}

I can get the following response (note the mix between SPARQL and Solr results):

<response>
    <lst name="responseHeader">
        <int name="status">0
        <int name="QTime">31
        <str name="query">SELECT * WHERE{ ?s ?p ?o}
    </lst>
<result name="response" numFound="3875" start="100" maxScore="1.0">
        <sparql>
            <head>
                <variable name="s"/>
                <variable name="p"/>
                <variable name="o"/>
            </head>
            <results>
                <result>
                    <binding name="s">
                        <uri>http://example/book2
                    </binding>
      ...
                </result>
      ...
        </results>
    </sparql>
</result>
    <lst name="facet_counts">
        <lst name="facet_queries"/>
        <lst name="facet_fields">
            <lst name="p">
                <int name="<http://example.org/ns#price>">231</int>
                <int name="<http://purl.org/dc/elements/1.1/creator>">1432</int>
                <int name="<http://purl.org/dc/elements/1.1/title>">2212</int>
            </lst>
          </lst>
        <lst name="facet_dates"/>
        <lst name="facet_ranges"/>
   </lst>
</response>

The first question is: what does trigger that hybrid search? I would like to maintain the standard SPARQL endpoint functionality so a good compromise could be the following:

if the query string contains only a q parameter then the plain SPARQL endpoint will execute the query. It will return a standard SPARQL-Result response;
if the query string contains also other parameters (at the moment I considered only the facet, facet.field, rows and start parameters) then a hybrid search will be executed, therefore providing results in the mixed mode listed above.

As usual, any feedback is warmly welcome!

Captain Gazza's Log...

Saturday, March 07, 2015

RDF meets Solr: S P O C Faceting

No comments: