Update: SolRDF, a working example of the topic discussed in this post is here. Just 2 minutes and you will be able to index and query RDF data in Solr.
The Solr built-in
UpdateRequestHandler supports several formats of input data. It delegates the actual data loading to a specific
ContentStreamLoader, depending on the content type of the incoming request (i.e. the Content-type header of the HTTP request). Currently, these are the available content types declared in the
UpdateRequestHandler class:
- application/xml or text/xml
- application/json or text/json
- application/csv or text/csv
- application/javabin
So, a client has several options to send its data to Solr; all what it needs is to prepare those data in a specific format and call the
UpdateRequestHandler (usually located at
/update endpoint) specifying the corresponding content type
> curl http://localhost:8080/solr/update -H "Content-Type: text/json" --data-binary @/home/agazzarini/data.json
The
UpdateRequestHandler can be extended, customized, and replaced; so we can write our own
UpdateRequestHandler that accepts a custom format, adding a new content type or overriding the default set of supported content types.
In this brief post, I will describe how to use
Jena to load RDF data in Solr, in any format supported by
Jena IO API.
This is a quick and easy task mainly because:
- the UpdateRequestHandler already has the logic to index data
- the UpdateRequestHandler can be easily extended
- Jena already provides all the parsers we need
So doing that, is just a matter of subclassing
UpdateRequestHandler in order to override the content type registry:
public class RdfDataUpdateRequestHandler extends UpdateRequestHandler
...
protected Map createDefaultLoaders(NamedList parameters) {
final Map<String, ContentStreamLoader> registry
= new HashMap<String, ContentStreamLoader>();
final ContentStreamLoader loader = new RdfDataLoader();
for (final Lang language : RDFLanguages.getRegisteredLanguages()) {
registry.put(language.getContentType().toHeaderString(), loader);
}
return registry;
}
As you can see, the registry is a simple Map that associates a content type (e.g. "application/xml") with an instance of
ContentStreamLoader. For our example, since the different content types will always map to RDF data, we create an instance of a dedicated
ContentStreamLoader (
RdfDataLoader) once; that instance will be associated with all built-in content types in Jena. That means each time an incoming request will have a content type like
- text/turtle
- application/turtle
- application/x-turtle
- application/rdf+xml
- application/rdf+json
- application/ld+json
- text/plain (for n-triple)
- application/n-triples
- (others)
Our
RdfDataLoader will be in charge to parse and load the data. Note that the above list is not exhaustive, there a lot of other content types registered in Jena (See the
RDFLanguages class).
So, what about the format of the data? Of course, it still depends on the content type of your RDF data, and most important, it has nothing to do with those data we used to send to Solr (i.e.
SolrInputDocuments serialized in some format).
The RdfDataLoader is a subclass of ContentStreamLoader
public class RdfDataLoader extends ContentStreamLoader
and, not surprisingly, it overrides the load() method:
public void load()
final SolrQueryRequest request,
final SolrQueryResponse response,
final ContentStream stream,
final UpdateRequestProcessor processor) throws Exception {
final PipedRDFIterator<Triple> iterator = new PipedRDFIterator<Triple>();
final PipedRDFStream<Triple> inputStream = new PipedTriplesStream(iterator);
// We use an executor for running the parser in a separate thread
final ExecutorService executor = Executors.newSingleThreadExecutor();
final Runnable parser = new Runnable() {
public void run() {
try {
RDFDataMgr.parse(
inputStream,
stream.getStream(),
RDFLanguages.contentTypeToLang(stream.getContentType()));
} catch (final IOException exception) {
...
}
}
};
executor.submit(parser);
while (iterator.hasNext()) {
final Triple triple = iterator.next();
// create and populate the Solr input document
final SolrInputDocument document = new SolrInputDocument();
...
// create the update command
final AddUpdateCommand command = new AddUpdateCommand(request);
// populate it with the input document we just created
command.solrDoc = document;
// add the document to index
processor.processAdd(command);
}
}
That's all...now, once the request handler has been registered within Solr (i.e. in solrconfig.xml), with a file containing RDF data in n-triples format, we can send to Solr a command like this:
> curl http://localhost:8080/solr/store/update -H "Content-Type: application/n-triples" --data-binary @/home/agazzarini/triples_dogfood.nt