Captain Gazza's Log...

Sunday, April 19, 2015

RDF Faceting with Apache Solr: SolRDF

"Faceted search, also called faceted navigation or faceted browsing, is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters."
(Source: Wikipedia)

Apache Solr built-in faceting capabilities are nicely described in the official Solr Reference Guide [1] or in the Solr Wiki [2].
In SolRDF, due to the nature of the underlying data, faceted search assumes a shape which is a bit different from traditional faceting over structured data. For instance, while in a traditional Solr schema we could have something like this:

<field name="title" .../>
<field name="author" .../>
<field name="publisher" .../>
<field name="publication_year" .../>
<field name="isbn" .../>
<field name="subject" .../>
...

In SolRDF data is always represented as a sequence of triples, that is, a set of assertions (aka statements), representing the state of a given entity by means of three / four compounding members: a subject, a predicate, an object and an optional context. The holding schema, which is described better in a dedicated section of this Wiki, is, simplifying, something like this:


<field name="s" .../>


<field name="p" .../>


<field name="o" .../>

A "book" entity would be represented, in RDF, in the following way:

<#xyz>
    dc:title "La Divina Commedia" ;
    dc:creator "Dante Alighieri" ;
    dc:publisher "ABCD Publishing";
    ...

A faceted search makes sense only when the target aggregation field or criteria leads to a literal value, a number, something that can be aggregated. That's the reason you will see, in a traditional Solr that indexes books, a request like this:

facet=true

&facet.field=year

&facet.field=subject

&facet.field=author

In the example above, we are requesting facets for three fields: 
year, subject and author.



In SolRDF we don't have such "dedicated" 
fields like year or author, but we always have s, p, o and an optional c. Faceting on those fields, although perfectly possible using plain Solr 
facet fields (e.g. facet.field=s&facet.field=p), doesn't make much 
sense because they are always URI or blank nodes.



Instead, the field 
where faceting reveals its power is the object. But again, asking 
for plain faceting on o field (i.e. facet.field=o), will result in a 
facet that aggregates apples and bananas: each object represents a 
different meaning, it could have a different domain and data-type.  We 
need a way to identify a given range of objects.



In RDF, what determines the range of the object of a given triple, is 
the second member, the predicate. So instead of indicating what is the 
target field of a given facet, we will indicate a query that selects a 
given range of objects values. An example will be surely more clear.

Solr (field) faceting:

facet=true&facet.field=author

SolRDF (field) faceting:

facet=true&facet.field.q=p:<http://purl.org/dc/elements/1.1/creator>

The query will select all objects having an author as value, and then faceting will use those values. The same concept can be applied to range faceting.

Facet Fields 

Traditional field faceting is supported on SolRDF: you can have a field (remember: s,p,o or c) to be treated as a facet by means of the facet.field parameter. All other parameters described in the Solr Reference Guide [1] are supported. Some examples:  

Ex #1: field faceting on predicates with a minimum count of 1

q=SELECT * WHERE { ?s ?p ?o }  
&facet=true  
&facet.field=p  
&facet.mincount=1   

Ex #2: field faceting on subjects and predicates with a different minimum count 


q=SELECT * WHERE { ?s ?p ?o }   

&facet=true   

&facet.field=p  
&facet.field=s  
&f.s.facet.mincount=1   
&f.p.facet.mincount=10   

Ex #3: field faceting on predicates with a prefix (Dublin Core namespace) and minimum count constraints

 

q=SELECT * WHERE { ?s ?p ?o }  
&facet=true  
&facet.field=p  
&facet.prefix=<http://purl.org/dc    

Object Queries Faceting 

Facet field queries have basically the same meaning of facet fields: the
 only difference is that, instead on indicating a target field, faceting
 is always done on the o(bject) field, and you can indicate, with a query, what are the objects that will be faceted. Some examples:

Ex #1: faceting on publishers

q=SELECT * WHERE { ?s ?p ?o }  
&facet=true  
&facet.object.q=p:<http://purl.org/dc/elements/1.1/publisher>

Ex #2: faceting on names (creators or collaborators)

q=SELECT * WHERE { ?s ?p ?o }  
&facet=true  
&facet.object.q=p:<http://purl.org/dc/elements/1.1/creator> p:<http://purl.org/dc/elements/1.1/collaborator>

Ex #3: faceting on relationships of a given resource

q=SELECT * WHERE { ?s ?p ?o }  
&facet=true  
&facet.object.q=s:<http://example.org#xyz> p:<http://purl.org/dc/elements/1.1/relation>

The facet.field.q parameter can be repeated using an optional progressive number as suffix in the parameter name:  

Ex #4: faceting on creators and languages

q=SELECT * WHERE { ?s ?p ?o }  
&facet=true  
&facet.object.q=p:<http://purl.org/dc/elements/1.1/creator> &facet.object.q=p:<http://purl.org/dc/elements/1.1/language>

or 

q=SELECT * WHERE { ?s ?p ?o }  
&facet=true  
&facet.object.q.1=p:<http://purl.org/dc/elements/1.1/creator> &facet.object.q.2=p:<http://purl.org/dc/elements/1.1/language>

In this case you will get a facet for each query, keyed using the query itself: 

<lst name="facet_counts">
 <lst name="facet_fields">
   <lst name="p:">
     <int name="Ross, Karlint">12

     <int name="Earl, James">9
     <int name="Foo, John">9
     ...
   </lst>
   <lst name="p:">
     <int name="en">3445
     <int name="de">2958
     <int name="it">2865
     ...
   </lst>
</lst>
</lst>

The suffix in the parameter name is not required, but it is useful to indicate an alias for each query:

q=SELECT * WHERE { ?s ?p ?o }
&facet=true
&facet.object.q.1=p:<http://purl.org/dc/elements/1.1/creator>
&facet.object.q.2=p:<http://purl.org/dc/elements/1.1/language>
&facet.object.q.alias.1=author
&facet.object.q.alias.2=language

The response in this case will be (note that each facet is now associated with the alias):

<lst name="facet_counts">
<lst name="facet_fields">
   <lst name="author">
     <int name="Ross, Karlint">12</int>
     <int name="Earl, James">9</int>
     <int name="Foo, John">9</int>
     ...
   </lst>
   <lst name="language">
     <int name="en">3445</int>
     <int name="de">2958</int>
     <int name="it">2865</int>
     ...
   </lst>
</lst><</lst>

Object Range Queries Faceting

Range faceting is described in the Solr Reference Guide [1] or in the Solr Wiki [2]. You can get this kind of facet on all fields that support range queries (e.g. dates and numerics).
A request like this:

facet.range=year
&facet.range.start=2000
&facet.range.end=2015
&facet.range.gap=1

will produce a response like this:

<lst name="facet_ranges">
    <lst name="year">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
          ...
       </lst>
       <int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    ...

As briefly explained before, with semi-structured data like RDF we don't have "year" or "price" or whatever strictly dedicated field for representing a given concept; we always have 3 or 4 fields:

a s(ubject)
a p(redicate)
an o(bject)
and optionally a c(ontext)

Requesting something like this:

facet.range=o

wouldn't work: we would mix again apples and bananas. In addition, without knowing in advance the domain of the target value (e.g. integer, double, date) how could we express a valid facet.range.start, facet.range.end and facet.range.gap?

Range faceting for s or p or c attributes doesn't make any sense at all because the corresponding URI datatype (i.e. string) doesn't support range queries.

In order to enable range faceting on SolRDF, the default FacetComponent has been replaced with a custom subclass that does something I called Objects Range Query Faceting, which is actually a mix between facet ranges and facet queries.

Facet because, of course, the final results are a set of facets
Object because faceting uses the o(bject) field
Range because what we are going to compute are facet ranges
Queries because instead of indicating the target attribute in request (by means of facet.range parameter), this kind of faceting requires a facet.range.q which is a query (by default parsed by the Solr Query Parser) that selects the objects (i.e. the "o" attribute) of all matching triples (i.e. SolrDocument instances) and then calculates the ranges on them.

In this way, we can issue a request like this:

facet.range.q=p:<http://a.b.c#start_year>

&facet.range.start=2000

&facet.range.end=2010

&facet.range.gap=1

or like this

facet.range.q=p:<http://c.d.e#release_date>  

&facet.range.start=2000-01-10T17:00:00Z 

&facet.range.end=2010-01-10T17:00:00Z

&facet.range.gap=+1MONTH

You can also have more than one facet.range.q parameter. In this case the facet response will look like this:

<lst name="facet_ranges">
    <lst name="p:">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
          ...
       </lst>
       <int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    <lst name="p:">
      <lst name="counts">
          <int name="2000-03-29T17:06:02Z">2516</int>
          <int name="2001-04-03T21:30:00Z">1272</int>
          ...
       </lst>
      <int name="gap">+1YEAR</int>
       <int name="start">2000-01-10T17:00:00Z</int>
       <int name="end">2010-01-10T17:00:00Z</int>
    </lst>
    ...

Aliasing is supported in the same way that has been described for Facet Objects Queries. The same request above with aliases would be:

facet.range.q.1=p:
&facet.range.q.alias.1=start_year_alias
&facet.range.q.hint.1=num <-- as="" default="" eric="" font="" is="" num="" optional="" the="" value="">
&facet.range.start.1=2000
&facet.range.end.1=2010
&facet.range.gap.1=1
&facet.range.q.2=p:
&facet.range.q.alias.2=release_date_alias
&facet.range.q.hint.2=date
&facet.range.start.2=2000-01-10T17:00:00Z
&facet.range.end.2=2010-01-10T17:00:00Z
&facet.range.gap.2=+1MONTH

Note in the response the aliases instead of the full queries:

<lst name="facet_ranges">
    <lst name="start_year_alias">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
          ...
       </lst>
       <int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    <lst name="release_date_alias">
      <lst name="counts">
          <int name="2000-03-29T17:06:02Z">2516</int>
          <int name="2001-04-03T21:30:00Z">1272</int>
          ...
       </lst>
      <int name="gap">+1YEAR</int>
       <int name="start">2000-01-10T17:00:00Z</int>
       <int name="end">2010-01-10T17:00:00Z</int>
    </lst>
    ...
Here you can find a sample response containing all facets described above.

You can find the same content of this post in the SolRDF Wiki [3]. As usual any feedback is warmly welcome!

-------------------------------------
[1] https://cwiki.apache.org/confluence/display/solr/Faceting
[2] https://wiki.apache.org/solr/SolrFacetingOverview
[3] https://github.com/agazzarini/SolRDF/wiki

Sunday, April 05, 2015

RDF Faceting: Query Facets + Range facets = Object Ranges Queries Facets

Faceting on semi-structured data like RDF is definitely (at least for me) an interesting topic.

The issue #28 and the issue #47 track the progresses about that feature on SolRDF: RDF Faceting.
I just committed a stable version of one of those kind of faceting: facets objects ranges queries (issue #28). You can find here a draft documentation about how faceting works in SolRDF.

In a preceding article I described how a plain and basic SPOC faceting works; here I introduce this new type of faceting: Object Ranges Queries Facets.

Range Faceting is an already built-in feature in Solr: you can get this facets on all fields that support range queries (e.g. dates and numerics). For instance, asking for something like this:

facet.range=year
facet.range.start=2000
facet.range.end=2015
facet.range.gap=1

you will get the following response:

<lst name="facet_ranges">
    <lst name="year">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
...
       </lst>

<int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    ...
Plain range faceting on RDF schema? mmm....

SolRDF indexes semi-structured data, so we don't have arbitrary fields like year, creation_date, price and so on...we always have these fields:

s(ubject)
p(redicate)
o(bject)
and optionally a c(ontext)

So here comes the question: how can I get the right domain values for my range facets? I don't have an explicit "year" or "price" or whatever attribute.
See the following data, which is a simple RDF representation of two projects (#xyz and #kyj):

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix abc: <http://a.b.c#> .
@prefix cde: <http://c.d.e#> .

<#xyz>
    abc:start_year "2001"^^xsd:integer ;
    abc:end_year "2003"^^xsd:integer ;
    cde:first_prototype_date 2001-06-15"^^xsd:date ;
    cde:last_prototype_date "2002-06-30"^^xsd:date ;
    cde:release_date "2003-10-10"^^xsd:date .

<#kyj>
    abc:start_year "2002"^^xsd:integer ;
    abc:end_year "2007"^^xsd:integer ;
    cde:first_prototype_date 2003-09-27"^^xsd:date ;
    cde:last_prototype_date "2005-08-24"^^xsd:date ;
    cde:release_date "2007-03-10"^^xsd:date .

The following table illustrates how the same data is indexed within Solr:

S(ubject)	P(redicate)	O(bject)
#xyz	http://a.b.c#start_year	"2001"^^xsd:integer
#xyz	http://a.b.c#end_year	"2003"^^xsd:integer
#xyz	http://c.d.e#first_prototype_date	"2001-06-15"^^xsd:date
...

As you can see, the "logical" name of the attribute that each triple represents is in the P column, while the value of that attribute is in the O cell. This is the main reason the plain Solr range faceting here wouldn't work: a request like this:

facet.range=o

would mix apples and bananas. In addition, without knowing in advance the domain of the target value (e.g. integer, double, date, datetime) how could we express a valid facet.range.start, facet.range.end and facet.range.gap?

Requesting the same thing for s or p attributes doesn't make any sense at all because the datatype (string) doesn't support this kind of faceting.

Object Ranges Queries Facets

In order to enable a range faceting that makes sense on SolRDF, I replaced the default FacetComponent with a custom subclass that does something I called Object Ranges Queries Facets, which is actually a mix between facet ranges and facet queries.

Object because the target field is the o(bject)

Facet because, of course, the final results are facets

Range because what we are going to compute are facet ranges

Queries because instead of indicating the target attribute in request (by means of facet.range parameter), this kind of faceting requires a facet.range.q which is a query (by default parsed by the Solr Query Parser) that selects the objects (i.e. the "o" attribute) of all matching triples (i.e. SolrDocument instances) and then calculates the ranges on top of them.

Returning to our example, we could issue a request like this:

facet.range.q=p:<http://a.b.c#start_year>

facet.range.start=2000

facet.range.end=2010

facet.range.gap=1

or like this

facet.range.q=p:<http://c.d.e#release_date>

facet.range.start=2000-01-10T17:00:00Z

facet.range.end=2010-01-10T17:00:00Z

facet.range.gap=+1MONTH

You can have more than one facet.range.q parameter. In this case the facet response will look like this:

<lst name="facet_ranges">
    <lst name="p:<http://a.b.c#start_year>">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
...
       </lst>
       <int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    <lst name="p:<http://c.d.e#release_date>">
      <lst name="counts">
          <int name="2000-03-29T17:06:02Z">2516</int>
          <int name="2001-04-03T21:30:00Z">1272</int>
...
       </lst>       <int name="gap">+1YEAR</int>
       <int name="start">2000-01-10T17:00:00Z</int>
       <int name="end">2010-01-10T17:00:00Z</int>
    </lst>
    ...

You can do more with request parameters, query aliasing and shared parameters. Please have a look at SolRDF Wiki.

As usual, feedbacks are warmly welcome ;)

Saturday, March 07, 2015

RDF meets Solr: S P O C Faceting

SolRDF is a Solr extension for indexing and searching RDF data.
In a preceding post I explained how to set-up SolRDF in two minutes, leveraging Maven for automatically building and installing the whole stuff.

Once installed, you can index data by issuing a command like this:

> curl -v http://localhost:8080/solr/store/update/bulk?commit=true -H "Content-Type: application/n-triples" --data-binary @/path-to-your-data/data.nt

and then, you can execute a SPARQL query in this way:

> curl "http://127.0.0.1:8080/solr/store/sparql"   --data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10"   -H "Accept: application/sparql-results+xml"

Now, since the whole stuff is running within a full text search engine, why don't we try to combine some of the cool features of Solr with SPARQL results?

The underlying idea is: SPARQL results serialization is standardized in several W3C documents and therefore cannot be changed. We need a way to embed those results in another response that will contain additional information like metadata, facets and so on.

Solr query response sounds perfect to accomplish this goal: I have only to replace the <result> section with a <sparql> document (note I'm specifically talking about the XML response writer, I implemented only this writer at this moment; other formats are coming in the next episodes...). Running a query like this

/sparql?facet=true&facet.field=p&start=100&rows=10&q=SELECT * WHERE {?s ?p ?o}

I can get the following response (note the mix between SPARQL and Solr results):

<response>
    <lst name="responseHeader">
        <int name="status">0
        <int name="QTime">31
        <str name="query">SELECT * WHERE{ ?s ?p ?o}
    </lst>
<result name="response" numFound="3875" start="100" maxScore="1.0">
        <sparql>
            <head>
                <variable name="s"/>
                <variable name="p"/>
                <variable name="o"/>
            </head>
            <results>
                <result>
                    <binding name="s">
                        <uri>http://example/book2
                    </binding>
      ...
                </result>
      ...
        </results>
    </sparql>
</result>
    <lst name="facet_counts">
        <lst name="facet_queries"/>
        <lst name="facet_fields">
            <lst name="p">
                <int name="<http://example.org/ns#price>">231</int>
                <int name="<http://purl.org/dc/elements/1.1/creator>">1432</int>
                <int name="<http://purl.org/dc/elements/1.1/title>">2212</int>
            </lst>
          </lst>
        <lst name="facet_dates"/>
        <lst name="facet_ranges"/>
   </lst>
</response>

The first question is: what does trigger that hybrid search? I would like to maintain the standard SPARQL endpoint functionality so a good compromise could be the following:

if the query string contains only a q parameter then the plain SPARQL endpoint will execute the query. It will return a standard SPARQL-Result response;
if the query string contains also other parameters (at the moment I considered only the facet, facet.field, rows and start parameters) then a hybrid search will be executed, therefore providing results in the mixed mode listed above.

As usual, any feedback is warmly welcome!

Tuesday, February 10, 2015

SPARQL Integration tests with SolRDF

Last year, I got a chance to give some contribution to a wonderful project, CumulusRDF, an RDF store on a cloud-based architecture. The Integration Test Suite was one of the most interesting task I worked on.

There, I used JUnit for running some examples coming from Learning SPARQL by Bob DuCharme (O'Reilly, 2013). Both O'Reilly and the Author (BTW thanks a lot) gave me permissions to do that in the project.

So, when I set up the first prototype of SolRDF, I wondered how I could create a complete (integration) test suite for doing more or less the same thing...and I came to the obvious conclusion that something of that work could be reused.

Something had to be changed. mainly because CumulusRDF uses Sesame as underlying RDF framework, while SolRDF uses Jena...but at the end it was a minor change...they are both valid, easy and powerful.

So, for my LearningSPARQL_ITCase I needed:

A setup method for loading the example data;
A teardown method for cleaning up the store;

The example data is provided, in the LearningSPARQL website, in several files. Each file can contain: a small dataset or a query or an expected result (in tabular format). So, returning to my tests, the flow should load the small dataset X, run the query Y and verify the results Z.

Although this post illustrates how to load a sample dataset in SolRDF, this is something that you can do from the command line, and not in a JUnit test. Instead, using Jena, in my Unit tests, I load the data in SolRDF using these few lines:

// DatasetAccessor provides access to
// remote datasets using SPARQL 1.1 Graph Store HTTP Protocol
DatasetAccessor dataset = DatasetAccessorFactory.createHTTP();

// Load a local memory model
Dataset memoryDataset = DatasetFactory.createMem();
Model memoryModel = memoryDataset.getDefaultModel();
memoryModel.read(dataURL, ...);

// Load the memory model in the remote dataset
dataset.add(memoryModel);

Ok, data has been loaded! In another post I will explain what I did, in SolRDF, for supporting the SPARQL 1.1 Graph Store HTTP Protocol. Keep in mind that the protocol is not fully covered at the moment.

Now, it's time to run a query and check the results. As you can see I'll execute the same query twice: the first is against a memory model, the second towards SolRDF. In this way, assuming the memory model of Jena is perfectly working, I will be able to check and compare results coming from the remote dataset (i.e. coming from SolRDF):

final Query query = QueryFactory.create(readQueryFromFile(...));
QueryExecution execution = null;
QueryExecution memExecution = null;
    try {
       execution = QueryExecutionFactory.sparqlService(SOLRDF_URL, query);
       memExecution = QueryExecutionFactory.create(query, memoryDataset);

       ResultSet rs = execution.execSelect();
       ResultSet mrs = memExecution.execSelect();
       assertTrue(ResultSetCompare.isomorphic(rs, mrs));
    } catch (...) {
       ...
    } finally {
       // Close executions
    }

After that, the RDF store needs to be cleared. Although the Graph Store protocol would come in our help, it cannot be implemented in Solr because some HTTP methods (i.e. PUT and DELETE) cannot be used in RequestHandlers. The SolrRequestParsers, which is the first handler of the incoming requests, allows those methods only for /schema and /config requests. So while a clean up could be easily done using something like this:

dataset.deleteDefault();

Or, in HTTP:

DELETE /rdf-graph-store?default HTTP/1.1
Host: example.com

I cannot implement such behaviour in Solr. After checking the SolrConfig and SolrRequestParsers classes I believe the most, non RDF, simple way to clean up the store is:

SolrServer solr = new HttpSolrServer(SOLRDF_URI);
solr.deleteByQuery("*:*");
solr.commit();

I know, that has nothing to do with RDF and with the Graph Store protocol, but I wouldn't like to change the Solr core and at the moment that represents a good compromise. After all, that code resides only in my unit tests.

That's all! I just merged all those stuff in the master so feel free to have a look. If you want to run the integration test suite you can do that from command line:

# cd $SOLRDF_HOME
# mvn clean install

or in Eclipse, using the predefined Maven launch configuration solrdf/src/dev/eclipse/run-integration-test-suite.launch. Just right-click on that file e choose "Run as..."

Regardless the way: you will see these messages:

(build section)

[INFO] -----------------------------------------------------------------
[INFO] Building Solr RDF plugin 1.0
[INFO] -----------------------------------------------------------------

(unit tests section)

-------------------------------------------------------
T E S T S
-------------------------------------------------------
...
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.691 sec
Tests run: 15, Failures: 0, Errors: 0, Skipped: 0

(cargo section. It starts the embedded Jetty)

[INFO] [beddedLocalContainer] Jetty 7.6.15.v20140411 Embedded starting...
...
[INFO] [beddedLocalContainer] Jetty 7.6.15.v20140411 Embedded started on port [8080]

(integration tests section)

------------------------------------------------------
T E S T S
-------------------------------------------------------
Running org.gazzax.labs.solrdf.integration.LearningSparql_ITCase
[INFO] Running Query with prefixes test...
[INFO] [store] webapp=/solr path=/rdf-graph-store params={default=} status=0 QTime=712
...
[DEBUG] : Query type 222, incoming Accept header...

(end)

[INFO] [store] Closing main searcher on request.
...
[INFO] [beddedLocalContainer] Jetty 7.6.15.v20140411 Embedded is stopped
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 42.302s
[INFO] Finished at: Tue Feb 10 18:19:21 CET 2015
[INFO] Final Memory: 39M/313M
[INFO] --------------------------------------------------

Best,
Andrea

Sunday, December 21, 2014

A Solr RDF Store and SPARQL endpoint in just 2 minutes

How to store and query RDF data in Solr? Here is a quick guide, just 2 minute / 5 steps and you will get that ;)

1. All what you need

A shell (in case you are on the dark side of the moon, all steps can be easily done in Eclipse or whatever IDE)
Java (7)
Apache Maven (3.x)
git

2. Checkout SolRDF code

Open a shell and type the following:

# cd /tmp
# git clone https://github.com/agazzarini/SolRDF.git solrdf-download

3. Build and Run SolRDF

# cd solrdf-download/solrdf
# mvn clean install
# cd solrdf-integration-tests
# mvn clean package cargo:run

The very first time you run this command a lor of things will be downloaded, Solr included. At the end you should see something like this:

[INFO] Jetty 7.6.15.v20140411 Embedded started on port [8080]
[INFO] Press Ctrl-C to stop the container...

SolRDF is up and running!

4. Add some data

Open another shell and type the following:

# curl -v http://localhost:8080/solr/store/update/bulk?commit=true \
-H "Content-Type: application/n-triples" \
--data-binary @/tmp/solrdf-download/solrdf/src/test/resources/sample_data/bsbm-generated-dataset.nt

Wait a moment...ok! You just added (about) 5000 triples!

5. Execute some query

Open another shell and type the following:

# curl "http://127.0.0.1:8080/solr/store/sparql" \
--data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10" \
-H "Accept: application/sparql-results+json"
...

# curl "http://127.0.0.1:8080/solr/store/sparql" \
--data-urlencode "q=SELECT * WHERE { ?s ?p ?o } LIMIT 10" \
-H "Accept: application/sparql-results+xml"
...

Et voilà! Enjoy! I'm still working on that...any suggestion about this idea is warmly welcome...and if you meet some annoying bug feel free to give me a shout ;)

Monday, December 01, 2014

Loading RDF (i.e. custom) data in Solr

Update: SolRDF, a working example of the topic discussed in this post is here. Just 2 minutes and you will be able to index and query RDF data in Solr.

The Solr built-in UpdateRequestHandler supports several formats of input data. It delegates the actual data loading to a specific ContentStreamLoader, depending on the content type of the incoming request (i.e. the Content-type header of the HTTP request). Currently, these are the available content types declared in the UpdateRequestHandler class:

application/xml or text/xml
application/json or text/json
application/csv or text/csv
application/javabin

So, a client has several options to send its data to Solr; all what it needs is to prepare those data in a specific format and call the UpdateRequestHandler (usually located at /update endpoint) specifying the corresponding content type

> curl http://localhost:8080/solr/update -H "Content-Type: text/json" --data-binary @/home/agazzarini/data.json

The UpdateRequestHandler can be extended, customized, and replaced; so we can write our own UpdateRequestHandler that accepts a custom format, adding a new content type or overriding the default set of supported content types.

In this brief post, I will describe how to use Jena to load RDF data in Solr, in any format supported by Jena IO API.
This is a quick and easy task mainly because:

the UpdateRequestHandler already has the logic to index data
the UpdateRequestHandler can be easily extended
Jena already provides all the parsers we need

So doing that, is just a matter of subclassing UpdateRequestHandler in order to override the content type registry:

public class RdfDataUpdateRequestHandler extends UpdateRequestHandler
...
    protected Map createDefaultLoaders(NamedList parameters) {
           final Map<String, ContentStreamLoader> registry
                      = new HashMap<String, ContentStreamLoader>();
           final ContentStreamLoader loader = new RdfDataLoader();
           for (final Lang language : RDFLanguages.getRegisteredLanguages()) {
                  registry.put(language.getContentType().toHeaderString(), loader);
           }
           return registry;
    }

As you can see, the registry is a simple Map that associates a content type (e.g. "application/xml") with an instance of ContentStreamLoader. For our example, since the different content types will always map to RDF data, we create an instance of a dedicated ContentStreamLoader (RdfDataLoader) once; that instance will be associated with all built-in content types in Jena. That means each time an incoming request will have a content type like

text/turtle

application/turtle

application/x-turtle

application/rdf+xml

application/rdf+json

application/ld+json

text/plain (for n-triple)

application/n-triples
(others)

Our RdfDataLoader will be in charge to parse and load the data. Note that the above list is not exhaustive, there a lot of other content types registered in Jena (See the RDFLanguages class).

So, what about the format of the data? Of course, it still depends on the content type of your RDF data, and most important, it has nothing to do with those data we used to send to Solr (i.e. SolrInputDocuments serialized in some format).

The RdfDataLoader is a subclass of ContentStreamLoader

public class RdfDataLoader extends ContentStreamLoader

and, not surprisingly, it overrides the load() method:

public void load()
            final SolrQueryRequest request,
            final SolrQueryResponse response,
            final ContentStream stream,
            final UpdateRequestProcessor processor) throws Exception {

        final PipedRDFIterator<Triple> iterator = new PipedRDFIterator<Triple>();
        final PipedRDFStream<Triple> inputStream = new PipedTriplesStream(iterator);
        // We use an executor for running the parser in a separate thread
        final ExecutorService executor = Executors.newSingleThreadExecutor();
        final Runnable parser = new Runnable() {
              public void run() {
                   try {
                        RDFDataMgr.parse(
                            inputStream,
                            stream.getStream(),
                            RDFLanguages.contentTypeToLang(stream.getContentType()));
                   } catch (final IOException exception) {
...
                   }
              }
        };

        executor.submit(parser);
        while (iterator.hasNext()) {
          final Triple triple = iterator.next();
            // create and populate the Solr input document
            final SolrInputDocument document = new SolrInputDocument();
            ...
             // create the update command
            final AddUpdateCommand command = new AddUpdateCommand(request);
            // populate it with the input document we just created
            command.solrDoc = document;

            // add the document to index
            processor.processAdd(command);
        }
}

That's all...now, once the request handler has been registered within Solr (i.e. in solrconfig.xml), with a file containing RDF data in n-triples format, we can send to Solr a command like this:

> curl http://localhost:8080/solr/store/update -H "Content-Type: application/n-triples" --data-binary @/home/agazzarini/triples_dogfood.nt

Monday, November 10, 2014

Preloading data at Solr startup

Yesterday, I was explaining to a friend of mine something about Solr. So I had several cores with different configurations and some sample data.

While switching between one example and another I just realized that each (first) time I had to load and index the sample data manually. That was good for the very first time, but after that, it was just a repetitive work. So I started looking for some helpful built-in thing...and I found it (at least I think): the SolrEventListeners

SolrEventListener is an interface that defines a set of callbacks on several Solr lifecycle events:

void postCommit()
void postSoftCommit()
void newSearcher(SolrIndexSearcher newSearcher, SolrIndexSearcher currentSearcher)

For this example, I'm not interested in the two first callbacks because the corresponding invocations will happen, as their name suggests, after hard and soft commit events.
The interesting method is instead newSearcher(...) which allows me to register a custom event listener associated with two events:

firstSearcher
newSearcher

In Solr, the Index Searcher which serves requests at a given time is called the current searcher. At startup time, there's no current searcher because the first one is created; hence we are in the "firstSearcher" event, which is exactly what I was looking for ;)

When another (i.e new) searcher is opened, it is prepared (i.e. auto-warmed) while the current one still serves the incoming requests. When the new searcher is ready, it will become the current searcher, it will handle any new search requests, and the old searcher will be closed (as soon as all requests it was servicing finish). This scenario is where the "newSearcher" callback is invoked.

As you can see, the callback method for those two events is the same, there's no a "firstSearcher" and a "newSearcher" method. The difference resides in the input arguments: for "firstSearcher" events there's no a currentSearcher so the second argument is null; this is obviously not true for "newSearcher" callbacks where both first and second arguments contain a valid searcher reference.

Returning to my scenario, all what I need is

to declare that listener in solrconfig.xml
a concrete implementation of SolrEventListener

In solrconfig.xml, within the <udateHandler> section I can declare my listener

<listener event="firstSearcher" class="a.b.c.SolrStartupListener">
<str name="datafile">${solr.solr.home}/sample/data.xml</str>
</listener>

The listener will be initialized with just one parameter, the file that contains the sample data. Using the "event" attribute I can inform Solr about the kind of event I'm interested on (i.e firstSearcher).

The implementation class is quite simple: it extends SolrEventListener:

public class SolrStartupListener implements SolrEventListener

in the init(...) method it retrieves the input argument:

@Override
public void init(final NamedList args) {
    this.datafile = (String) args.get("datafile");
}

last, the newSearcher method preloads the data:

    LocalSolrQueryRequest request = null;
    try {
           // 1. Create the arguments map for the update request
           final NamedList args = new SimpleOrderedMap();
            args.add(
                    UpdateParams.ASSUME_CONTENT_TYPE,
                    "text/xml");
            addEventParms(currentSearcher, args);

            // 2. Create a new Solr (update) request
            request = new LocalSolrQueryRequest(
                     newSearcher.getCore(),
                     args);

            // 3. Fill the request with the (datafile) input stream
            final List streams = new ArrayList();
            streams.add(new ContentStreamBase() {
                @Override
                public InputStream getStream() throws IOException {
                    return new FileInputStream(datafile);
                }
            });

            request.setContentStreams(streams);

            // 4. Creates a new Solr response
            final SolrQueryResponse response =
                new SolrQueryResponse();

            // 5. And finally call invoke the update handler
            SolrRequestInfo.setRequestInfo(
                new SolrRequestInfo(request, response))

            newSearcher
                 .getCore()
                 .getRequestHandler("/update")
                 .handleRequest(request, response);

        } finally {
            request.close();
        }

Et voilà, if you start Solr you will see sample data loaded. Other than avoiding me a lot of repetitive tasks, this could be useful when you're using a SolrCore as a NoSql storage, like for example if you are storing SKOS vocabularies for synonyns, translations and broader / narrower searches.