Sunday, April 19, 2015

RDF Faceting with Apache Solr: SolRDF

"Faceted search, also called faceted navigation or faceted browsing, is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters."
(Source: Wikipedia)

Apache Solr built-in faceting capabilities are nicely described in the official Solr Reference Guide [1] or in the Solr Wiki [2].
In SolRDF, due to the nature of the underlying data, faceted search assumes a shape which is a bit different from traditional faceting over structured data. For instance, while in a traditional Solr schema we could have something like this:

<field name="title" .../>
<field name="author" .../>

<field name="publisher" .../>
<field name="publication_year" .../>
<field name="isbn" .../>
<field name="subject" .../>
...


In SolRDF data is always represented as a sequence of triples, that is, a set of assertions (aka statements), representing the state of a given entity by means of three / four compounding members: a subject, a predicate, an object and an optional context. The holding schema, which is described better in a dedicated section of this Wiki, is, simplifying, something like this:

<!-- Subject -->
<field name="s" .../>

<!-- Predicate -->
<field name="p" .../>
 
<!-- Object -->
<field name="o" .../>

A "book" entity would be represented, in RDF, in the following way:

<#xyz>
    dc:title "La Divina Commedia" ;  
    dc:creator "Dante Alighieri" ;
    dc:publisher "ABCD Publishing";
    ...


A faceted search makes sense only when the target aggregation field or criteria leads to a literal value, a number, something that can be aggregated. That's the reason you will see, in a traditional Solr that indexes books, a request like this: 

facet=true 
&facet.field=year 
&facet.field=subject  
&facet.field=author
 
In the example above, we are requesting facets for three fields: year, subject and author.

In SolRDF we don't have such "dedicated" fields like year or author, but we always have s, p, o and an optional c. Faceting on those fields, although perfectly possible using plain Solr facet fields (e.g. facet.field=s&facet.field=p), doesn't make much sense because they are always URI or blank nodes.

Instead, the field where faceting reveals its power is the object. But again, asking for plain faceting on o field (i.e. facet.field=o), will result in a facet that aggregates apples and bananas: each object represents a different meaning, it could have a different domain and data-type. We need a way to identify a given range of objects.

In RDF, what determines the range of the object of a given triple, is the second member, the predicate. So instead of indicating what is the target field of a given facet, we will indicate a query that selects a given range of objects values. An example will be surely more clear.
Solr (field) faceting:

facet=true&facet.field=author

SolRDF (field) faceting:

facet=true&facet.field.q=p:<http://purl.org/dc/elements/1.1/creator>

The query will select all objects having an author as value, and then faceting will use those values. The same concept can be applied to range faceting.

Facet Fields


Traditional field faceting is supported on SolRDF: you can have a field (remember: s,p,o or c) to be treated as a facet by means of the facet.field parameter. All other parameters described in the Solr Reference Guide [1] are supported. Some examples: 

Ex #1: field faceting on predicates with a minimum count of 1

 

q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.field=p 
&facet.mincount=1   

Ex #2: field faceting on subjects and predicates with a different minimum count


q=SELECT * WHERE { ?s ?p ?o }   
&facet=true   
&facet.field=p 
&facet.field=s 
&
f.s.facet.mincount=1  
&f.p.facet.mincount=10   

Ex #3: field faceting on predicates with a prefix (Dublin Core namespace) and minimum count constraints

 

q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.field=p 
&facet.prefix=<http://purl.org/dc
   

Object Queries Faceting


Facet field queries have basically the same meaning of facet fields: the only difference is that, instead on indicating a target field, faceting is always done on the o(bject) field, and you can indicate, with a query, what are the objects that will be faceted. Some examples:

Ex #1: faceting on publishers


q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q=p:<http://purl.org/dc/elements/1.1/publisher>


Ex #2: faceting on names (creators or collaborators)


q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q=p:<http://purl.org/dc/elements/1.1/creator> p:<http://purl.org/dc/elements/1.1/collaborator>


Ex #3: faceting on relationships of a given resource


q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q=s:<http://example.org#xyz> p:<http://purl.org/dc/elements/1.1/relation>


The facet.field.q parameter can be repeated using an optional progressive number as suffix in the parameter name:  

Ex #4: faceting on creators and languages


q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q=p:<http://purl.org/dc/elements/1.1/creator> &facet.object.q=p:<http://purl.org/dc/elements/1.1/language>


or

q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q.1=p:<http://purl.org/dc/elements/1.1/creator>
&facet.object.q.2=p:<http://purl.org/dc/elements/1.1/language>
 
In this case you will get a facet for each query, keyed using the query itself:

<lst name="facet_counts">
 
<lst name="facet_fields">
  
<lst name="p:">
    
<int name="Ross, Karlint">12
     <int name="Earl, James">9
     <int name="Foo, John">9
     ...
   </lst>
  
<lst name="p:">
    
<int name="en">3445

     <int name="de">2958
     <int name="it">2865
     ...
   </lst>
 
</lst> 

</lst

The suffix in the parameter name is not required, but it is useful to indicate an alias for each query:

q=SELECT * WHERE { ?s ?p ?o } 
&facet=true 
&facet.object.q.1=p:<http://purl.org/dc/elements/1.1/creator>

&facet.object.q.2=p:<http://purl.org/dc/elements/1.1/language>  
&facet.object.q.alias.1=author
&facet.object.q.alias.2=language 

The response in this case will be (note that each facet is now associated with the alias):

<lst name="facet_counts">
 
<lst name="facet_fields">
  
<lst name="author">
    
<int name="Ross, Karlint">12</int>
     <int name="Earl, James">9</int>
     <int name="Foo, John">9</int>
     ...
   </lst>
  
<lst name="language">
    
<int name="en">3445
</int>
     <int name="de">2958</int>
     <int name="it">2865</int>
     ...
   </lst>
 
</lst><</lst



Object Range Queries Faceting


Range faceting is described in the Solr Reference Guide [1] or in the Solr Wiki [2]. You can get this kind of facet on all fields that support range queries (e.g. dates and numerics).
A request like this:

facet.range=year
&facet.range.start=2000
&facet.range.end=2015
&facet.range.gap=1

will produce a response like this:


<lst name="facet_ranges">
   
<lst name="year">
     
<lst name="counts">
         
<int name="2000">3445</int>
         
<int name="2001">2862</int>
         
<int name="2002">2776</int>
         
<int name="2003">2865</int>
          ... 
      
</lst>     
      
<int name="gap">1</int>
      
<int name="start">2000</int>
      
<int name="end">2010</int>
   
</
lst>
    ...

 

As briefly explained before, with semi-structured data like RDF we don't have "year" or "price" or whatever strictly dedicated field for representing a given concept; we always have 3 or 4 fields:
  • a s(ubject)
  • a p(redicate)
  • an o(bject)
  • and optionally a c(ontext)
Requesting something like this:

facet.range=o

wouldn't work: we would mix again apples and bananas. In addition, without knowing in advance the domain of the target value (e.g. integer, double, date) how could we express a valid facet.range.start, facet.range.end and facet.range.gap?

Range faceting for s or p or c attributes doesn't make any sense at all because the corresponding URI datatype (i.e. string) doesn't support range queries.

In order to enable range faceting on SolRDF, the default FacetComponent has been replaced with a custom subclass that does something I called Objects Range Query Faceting, which is actually a mix between facet ranges and facet queries.
  • Facet because, of course, the final results are a set of facets
  • Object because faceting uses the o(bject) field
  • Range because what we are going to compute are facet ranges
  • Queries because instead of indicating the target attribute in request (by means of facet.range parameter), this kind of faceting requires a facet.range.q which is a query (by default parsed by the Solr Query Parser) that selects the objects (i.e. the "o" attribute) of all matching triples (i.e. SolrDocument instances) and then calculates the ranges on them.
In this way, we can issue a request like this:

facet.range.q=p:<http://a.b.c#start_year>
&facet.range.start=2000
&facet.range.end=2010
&facet.range.gap=1

or like this

facet.range.q=p:<http://c.d.e#release_date>  
&facet.range.start=2000-01-10T17:00:00Z 
&facet.range.end=2010-01-10T17:00:00Z
&facet.range.gap=+1MONTH

You can also have more than one facet.range.q parameter. In this case the facet response will look like this:

<lst name="facet_ranges">
   
<lst name="p:">
     
<lst name="counts">
         
<int name="2000">3445</int>
         
<int name="2001">2862</int>
         
<int name="2002">2776</int>
         
<int name="2003">2865</int>
          ...
      
</lst>
      
<int name="gap">1</int>
      
<int name="start">2000</int>
      
<int name="end">2010</int>
   
</lst>
   
<lst name="p:">
     
<lst name="counts">
         
<int name="2000-03-29T17:06:02Z">2516</int>
         
<int name="2001-04-03T21:30:00Z">1272</int>
          ...
      
</lst>       

      <int name="gap">+1YEAR</int>
      
<int name="start">2000-01-10T17:00:00Z</int>
      
<int name="end">2010-01-10T17:00:00Z
</int>
    </lst>
    ...


Aliasing is supported in the same way that has been described for Facet Objects Queries. The same request above with aliases would be:

facet.range.q.1=p:
&facet.range.q.alias.1=start_year_alias
&facet.range.q.hint.1=num <-- as="" default="" eric="" font="" is="" num="" optional="" the="" value="">

&facet.range.start.1=2000 
&facet.range.end.1=2010
&facet.range.gap.1=1
&facet.range.q.2=p:
&facet.range.q.alias.2=release_date_alias
&facet.range.q.hint.2=date
&facet.range.start.2=2000-01-10T17:00:00Z
&facet.range.end.2=2010-01-10T17:00:00Z
&facet.range.gap.2=+1MONTH


Note in the response the aliases instead of the full queries:


<lst name="facet_ranges">
    <lst name="start_year_alias">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
          ...
       </lst>
       <int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    <lst name="release_date_alias">
      <lst name="counts">
          <int name="2000-03-29T17:06:02Z">2516</int>
          <int name="2001-04-03T21:30:00Z">1272</int>
          ...
       </lst>       

      <int name="gap">+1YEAR</int>
       <int name="start">2000-01-10T17:00:00Z</int>
       <int name="end">2010-01-10T17:00:00Z</int>
    </lst>
    ...

Here you can find a sample response containing all facets described above.

You can find the same content of this post in the SolRDF Wiki [3]. As usual any feedback is warmly welcome!

-------------------------------------
[1] https://cwiki.apache.org/confluence/display/solr/Faceting
[2] https://wiki.apache.org/solr/SolrFacetingOverview
[3] https://github.com/agazzarini/SolRDF/wiki

Sunday, April 05, 2015

RDF Faceting: Query Facets + Range facets = Object Ranges Queries Facets

Faceting on semi-structured data like RDF is definitely (at least for me) an interesting topic.

The issue #28 and the issue #47 track the progresses about that feature on SolRDF: RDF Faceting.
I just committed a stable version of one of those kind of faceting: facets objects ranges queries (issue #28). You can find here a draft documentation about how faceting works in SolRDF.   

In a preceding article I described how a plain and basic SPOC faceting works; here I introduce this new type of faceting: Object Ranges Queries Facets.

Range Faceting is an already built-in feature in Solr: you can get this facets on all fields that support range queries (e.g. dates and numerics). For instance, asking for something like this:

facet.range=year
facet.range.start=2000
facet.range.end=2015
facet.range.gap=1

you will get the following response:

<lst name="facet_ranges">
    <lst name="year">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
          ...  
       </lst>      
       <int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    ...
Plain range faceting on RDF schema? mmm....

SolRDF indexes semi-structured data, so we don't have arbitrary fields like year, creation_date, price and so on...we always have these fields:
  • s(ubject)
  • p(redicate)
  • o(bject) 
  • and optionally a c(ontext)
So here comes the question: how can I get the right domain values for my range facets? I don't have an explicit "year" or "price" or whatever attribute.
See the following data, which is a simple RDF representation of two projects (#xyz and #kyj):

@prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt; .
@prefix abc: &lt;http://a.b.c#&gt; .
@prefix cde: &lt;http://c.d.e#&gt; . 

<#xyz> 
    abc:start_year "2001"^^xsd:integer ;
    abc:end_year "2003"^^xsd:integer ;
    cde:first_prototype_date 2001-06-15"^^xsd:date ;
    cde:last_prototype_date "2002-06-30"^^xsd:date ;
    cde:release_date  "2003-10-10"^^xsd:date .

<#kyj> 
    abc:start_year "2002"^^xsd:integer ;
    abc:end_year "2007"^^xsd:integer ;
    cde:first_prototype_date 2003-09-27"^^xsd:date ;
    cde:last_prototype_date "2005-08-24"^^xsd:date ;
    cde:release_date  "2007-03-10"^^xsd:date .

The following table illustrates how the same data is indexed within Solr:
S(ubject)P(redicate)O(bject)
#xyz http://a.b.c#start_year "2001"^^xsd:integer
#xyz http://a.b.c#end_year "2003"^^xsd:integer
#xyz http://c.d.e#first_prototype_date "2001-06-15"^^xsd:date
...

As you can see, the "logical" name of the attribute that each triple represents is in the P column, while the value of that attribute is in the O cell. This is the main reason the plain Solr range faceting here wouldn't work: a request like this:

facet.range=o

would mix apples and bananas. In addition, without knowing in advance the domain of the target value (e.g. integer, double, date, datetime) how could we express a valid facet.range.start, facet.range.end and facet.range.gap?

Requesting the same thing for s or p attributes doesn't make any sense at all because the datatype (string) doesn't support this kind of faceting. 

Object Ranges Queries Facets

In order to enable a range faceting that makes sense on SolRDF, I replaced the default FacetComponent with a custom subclass that does something I called Object Ranges Queries Facets, which is actually a mix between facet ranges and facet queries.
  • Object because the target field is the o(bject)
  • Facet because, of course, the final results are facets
  • Range because what we are going to compute are facet ranges
  • Queries because instead of indicating the target attribute in request (by means of facet.range parameter), this kind of faceting requires a facet.range.q which is a query (by default parsed by the Solr Query Parser) that selects the objects (i.e. the "o" attribute) of all matching triples (i.e. SolrDocument instances) and then calculates the ranges on top of them.
Returning to our example, we could issue a request like this:

facet.range.q=p:<http://a.b.c#start_year>
facet.range.start=2000
facet.range.end=2010
facet.range.gap=1

or like this

facet.range.q=p:<http://c.d.e#release_date>
facet.range.start=2000-01-10T17:00:00Z
facet.range.end=2010-01-10T17:00:00Z
facet.range.gap=+1MONTH

You can have more than one facet.range.q parameter. In this case the facet response will look like this:

<lst name="facet_ranges">
    <lst name="p:<http://a.b.c#start_year>">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
          ...
       </lst>
       <int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    <lst name="p:<http://c.d.e#release_date>">
      <lst name="counts">
          <int name="2000-03-29T17:06:02Z">2516</int>
          <int name="2001-04-03T21:30:00Z">1272</int>
          ...
       </lst>       <int name="gap">+1YEAR</int>
       <int name="start">2000-01-10T17:00:00Z</int>
       <int name="end">2010-01-10T17:00:00Z</int>
    </lst>
    ...

You can do more with request parameters, query aliasing and shared parameters. Please have a look at SolRDF Wiki.

As usual, feedbacks are warmly welcome ;)