Captain Gazza's Log...: linked data

Showing posts with label linked data. Show all posts

Sunday, April 05, 2015

RDF Faceting: Query Facets + Range facets = Object Ranges Queries Facets

Faceting on semi-structured data like RDF is definitely (at least for me) an interesting topic.

The issue #28 and the issue #47 track the progresses about that feature on SolRDF: RDF Faceting.
I just committed a stable version of one of those kind of faceting: facets objects ranges queries (issue #28). You can find here a draft documentation about how faceting works in SolRDF.

In a preceding article I described how a plain and basic SPOC faceting works; here I introduce this new type of faceting: Object Ranges Queries Facets.

Range Faceting is an already built-in feature in Solr: you can get this facets on all fields that support range queries (e.g. dates and numerics). For instance, asking for something like this:

facet.range=year
facet.range.start=2000
facet.range.end=2015
facet.range.gap=1

you will get the following response:

<lst name="facet_ranges">
    <lst name="year">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
...
       </lst>

<int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    ...
Plain range faceting on RDF schema? mmm....

SolRDF indexes semi-structured data, so we don't have arbitrary fields like year, creation_date, price and so on...we always have these fields:

s(ubject)
p(redicate)
o(bject)
and optionally a c(ontext)

So here comes the question: how can I get the right domain values for my range facets? I don't have an explicit "year" or "price" or whatever attribute.
See the following data, which is a simple RDF representation of two projects (#xyz and #kyj):

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix abc: <http://a.b.c#> .
@prefix cde: <http://c.d.e#> .

<#xyz>
    abc:start_year "2001"^^xsd:integer ;
    abc:end_year "2003"^^xsd:integer ;
    cde:first_prototype_date 2001-06-15"^^xsd:date ;
    cde:last_prototype_date "2002-06-30"^^xsd:date ;
    cde:release_date "2003-10-10"^^xsd:date .

<#kyj>
    abc:start_year "2002"^^xsd:integer ;
    abc:end_year "2007"^^xsd:integer ;
    cde:first_prototype_date 2003-09-27"^^xsd:date ;
    cde:last_prototype_date "2005-08-24"^^xsd:date ;
    cde:release_date "2007-03-10"^^xsd:date .

The following table illustrates how the same data is indexed within Solr:

S(ubject)	P(redicate)	O(bject)
#xyz	http://a.b.c#start_year	"2001"^^xsd:integer
#xyz	http://a.b.c#end_year	"2003"^^xsd:integer
#xyz	http://c.d.e#first_prototype_date	"2001-06-15"^^xsd:date
...

As you can see, the "logical" name of the attribute that each triple represents is in the P column, while the value of that attribute is in the O cell. This is the main reason the plain Solr range faceting here wouldn't work: a request like this:

facet.range=o

would mix apples and bananas. In addition, without knowing in advance the domain of the target value (e.g. integer, double, date, datetime) how could we express a valid facet.range.start, facet.range.end and facet.range.gap?

Requesting the same thing for s or p attributes doesn't make any sense at all because the datatype (string) doesn't support this kind of faceting.

Object Ranges Queries Facets

In order to enable a range faceting that makes sense on SolRDF, I replaced the default FacetComponent with a custom subclass that does something I called Object Ranges Queries Facets, which is actually a mix between facet ranges and facet queries.

Object because the target field is the o(bject)

Facet because, of course, the final results are facets

Range because what we are going to compute are facet ranges

Queries because instead of indicating the target attribute in request (by means of facet.range parameter), this kind of faceting requires a facet.range.q which is a query (by default parsed by the Solr Query Parser) that selects the objects (i.e. the "o" attribute) of all matching triples (i.e. SolrDocument instances) and then calculates the ranges on top of them.

Returning to our example, we could issue a request like this:

facet.range.q=p:<http://a.b.c#start_year>

facet.range.start=2000

facet.range.end=2010

facet.range.gap=1

or like this

facet.range.q=p:<http://c.d.e#release_date>

facet.range.start=2000-01-10T17:00:00Z

facet.range.end=2010-01-10T17:00:00Z

facet.range.gap=+1MONTH

You can have more than one facet.range.q parameter. In this case the facet response will look like this:

<lst name="facet_ranges">
    <lst name="p:<http://a.b.c#start_year>">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
...
       </lst>
       <int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    <lst name="p:<http://c.d.e#release_date>">
      <lst name="counts">
          <int name="2000-03-29T17:06:02Z">2516</int>
          <int name="2001-04-03T21:30:00Z">1272</int>
...
       </lst>       <int name="gap">+1YEAR</int>
       <int name="start">2000-01-10T17:00:00Z</int>
       <int name="end">2010-01-10T17:00:00Z</int>
    </lst>
    ...

You can do more with request parameters, query aliasing and shared parameters. Please have a look at SolRDF Wiki.

As usual, feedbacks are warmly welcome ;)

Monday, December 01, 2014

Loading RDF (i.e. custom) data in Solr

Update: SolRDF, a working example of the topic discussed in this post is here. Just 2 minutes and you will be able to index and query RDF data in Solr.

The Solr built-in UpdateRequestHandler supports several formats of input data. It delegates the actual data loading to a specific ContentStreamLoader, depending on the content type of the incoming request (i.e. the Content-type header of the HTTP request). Currently, these are the available content types declared in the UpdateRequestHandler class:

application/xml or text/xml
application/json or text/json
application/csv or text/csv
application/javabin

So, a client has several options to send its data to Solr; all what it needs is to prepare those data in a specific format and call the UpdateRequestHandler (usually located at /update endpoint) specifying the corresponding content type

> curl http://localhost:8080/solr/update -H "Content-Type: text/json" --data-binary @/home/agazzarini/data.json

The UpdateRequestHandler can be extended, customized, and replaced; so we can write our own UpdateRequestHandler that accepts a custom format, adding a new content type or overriding the default set of supported content types.

In this brief post, I will describe how to use Jena to load RDF data in Solr, in any format supported by Jena IO API.
This is a quick and easy task mainly because:

the UpdateRequestHandler already has the logic to index data
the UpdateRequestHandler can be easily extended
Jena already provides all the parsers we need

So doing that, is just a matter of subclassing UpdateRequestHandler in order to override the content type registry:

public class RdfDataUpdateRequestHandler extends UpdateRequestHandler
...
    protected Map createDefaultLoaders(NamedList parameters) {
           final Map<String, ContentStreamLoader> registry
                      = new HashMap<String, ContentStreamLoader>();
           final ContentStreamLoader loader = new RdfDataLoader();
           for (final Lang language : RDFLanguages.getRegisteredLanguages()) {
                  registry.put(language.getContentType().toHeaderString(), loader);
           }
           return registry;
    }

As you can see, the registry is a simple Map that associates a content type (e.g. "application/xml") with an instance of ContentStreamLoader. For our example, since the different content types will always map to RDF data, we create an instance of a dedicated ContentStreamLoader (RdfDataLoader) once; that instance will be associated with all built-in content types in Jena. That means each time an incoming request will have a content type like

text/turtle

application/turtle

application/x-turtle

application/rdf+xml

application/rdf+json

application/ld+json

text/plain (for n-triple)

application/n-triples
(others)

Our RdfDataLoader will be in charge to parse and load the data. Note that the above list is not exhaustive, there a lot of other content types registered in Jena (See the RDFLanguages class).

So, what about the format of the data? Of course, it still depends on the content type of your RDF data, and most important, it has nothing to do with those data we used to send to Solr (i.e. SolrInputDocuments serialized in some format).

The RdfDataLoader is a subclass of ContentStreamLoader

public class RdfDataLoader extends ContentStreamLoader

and, not surprisingly, it overrides the load() method:

public void load()
            final SolrQueryRequest request,
            final SolrQueryResponse response,
            final ContentStream stream,
            final UpdateRequestProcessor processor) throws Exception {

        final PipedRDFIterator<Triple> iterator = new PipedRDFIterator<Triple>();
        final PipedRDFStream<Triple> inputStream = new PipedTriplesStream(iterator);
        // We use an executor for running the parser in a separate thread
        final ExecutorService executor = Executors.newSingleThreadExecutor();
        final Runnable parser = new Runnable() {
              public void run() {
                   try {
                        RDFDataMgr.parse(
                            inputStream,
                            stream.getStream(),
                            RDFLanguages.contentTypeToLang(stream.getContentType()));
                   } catch (final IOException exception) {
...
                   }
              }
        };

        executor.submit(parser);
        while (iterator.hasNext()) {
          final Triple triple = iterator.next();
            // create and populate the Solr input document
            final SolrInputDocument document = new SolrInputDocument();
            ...
             // create the update command
            final AddUpdateCommand command = new AddUpdateCommand(request);
            // populate it with the input document we just created
            command.solrDoc = document;

            // add the document to index
            processor.processAdd(command);
        }
}

That's all...now, once the request handler has been registered within Solr (i.e. in solrconfig.xml), with a file containing RDF data in n-triples format, we can send to Solr a command like this:

> curl http://localhost:8080/solr/store/update -H "Content-Type: application/n-triples" --data-binary @/home/agazzarini/triples_dogfood.nt

Monday, August 18, 2014

Jena-nosql: A NoSQL adapter for Apache Jena

Few days ago I started this project on github.

https://github.com/agazzarini/jena-nosql

The overall design rounds around the Abstract Factory design pattern [1].
As you can see from the following diagram, the StorageLayerFactory class plays the role of the AbstractFactory and therefore defines the contract that each concrete implementor (i.e. family) must provide in order to create concrete products for a specific kind of storage.

On top of that, each binding module defines the "concrete" layer that is in charge to provide

an implementation of the StorageLayerFactory (i.e. the Concrete Factory);
a concrete implementation of each (abstract) product defined in the diagram below (i.e. the Concrete Products)

Here you can see the same diagram as above but with the "Cassandra" family members (note that only 4 members are shown in order to simplify the diagram)

Feel free to take a look and let me know your thoughts.

Gazza