Monday, March 14, 2011

SOLR: implementing an "Exact Match" BATH profile compliant field

If you want to expose a Z3950 interface for your IR probably sooner or later you will meet the BATH profile, which is basically a set of rules that promotes standards behaviours between Z3950 servers.
Aims of those specification is to determine a list of searches (fields, attributes) that should be supported by a Z3950 server endpoint.
I won't discuss in this article about how to set up a Z3950 endpoint using SOLR behind the scenes because there are a lot of places where you can find such information.
Instead, I will write down a brief note about the so-called "Exact Match" search, which is one of the most interesting part of the story...
Just to directly go to the problem. The following are the specification of the "Exact Match" search:
  • Position: first in field
  • Truncation: do not truncate
  • Completeness: complete field
  • Structure: phrase
This is a pretty simple scenario with a little issue / question (that after this reading could be still open because what are you reading is just my interpretation of the story, not the absolute truth). In general the definition is quite clear:
  1. you (as server endpoint) must search within the given index assuming that the match should be done keeping in mind that the user entered terms (i.e. search string) must be considered as a starting value (first in field).
  2. the user entered terms are supposed to be "complete words" (do no truncate)
  3. what the user entered is a complete field (e.g a complete title or author name)
So if in our index there's a document with Alessandro Manzoni as author, an exact match query will find this document only and only if user entered values are
Alessandro Manzoni, alessandro manzoni (assuming a minimal text analysis is done with lowercasing)
for example the following queries won't match that document:
manzoni alessandro, Manzoni alessandro
because proximity search is not mentioned in the specification (ok, it's useful in real life but that's another story)
Another type of text analysis that, in my opinion, could be applied without violating the specification, is removing intra-word delimiters. Lets do an example. If I have
Manzoni, Alessandro.
It's hard to imagine that a user will be able to do an exact match query by tiping exactly what is written in the index. Instead I think should be better (both at index and query time) remove the intra-word delimiters (including trailing punctuation) and make life easier...in this way these queries:
manzoni alessandro, manzoni,alessandro., manzoni alessandro..,manzoni.alessandro,
will find this document. So what are the appropriate manipulation that a SOLR field needs in order to accomodate that "Exact Match" requirement?
If I should strictly adhere to bath requirements, my field should be a simple string like this:
<field name="author" type="string" indexed="true"/>
but referring to the last example, only a search for
Manzoni, Alessandro. (with exact punctuation)
will match the corresponding document, and that's not exactly what we want.
Another approach would be to assign a solr.TextField type to our field. On top of that lowercase and tokenizer filter will do the remaining part. Let's take another example. This time the author is
Contessa Serbelloni Mazzanti Viendalmare
Following the mentioned approach this input value will be transformed in this way:
Contessa Serbelloni Mazzanti Viendalmare (original)
contessa serbelloni mazzanti viendalmare (lowercase)
contessa, serbelloni, mazzanti, viendalmare (word tokenizer)
If a user enter the following query
"contessa Serbelloni MAZZANTI viendalmare"
a match will be found. Now, the reason why this is not sufficient for our bath profile compliant can be found in the specification of the "Exact Match" search
  • Position: first in field
  • Truncation: do not truncate
  • Completeness: complete field
  • Structure: phrase
First in field means that, in order to match a document, the indexed field must contains user entered terms (with the given order because is a "phrase" search) in the first position. In addition, complete field means that what the user entered is supposed to be the complete value of the target field. So, the indexing approach followed above will violate these two preconditions. How? Here it is: if the user enter the following:
"MAZZANTI viendalmare"
a match with the same document will still be found because terms are in the target indexed field with the given order. But they aren't at the start of the field and they don't represent the whole literal value. So, even if a little bit better, this approach don't work.

Briefly, here what I did in order to satisty the "Exact Match" requirement. Both at index and query time, starting with


Contessa Serbèlloni, Mazzànti Viendalmarè.

a) Keyword Tokenizer

At this time I don't want to tokenize my input value so this filter basically does nothing, leaving the value as is (treating the whole string as a single token).

b) Lowercase

As mentioned above, this is the first filter that we will apply. That will result in the following transformation:
contessa serbèlloni, mazzànti viendalmarè.

c) Diacritics replacement

Another important normalization that we will apply is a diacritic replacement, both at index and query time. This will ensure that a value like:
àndréà gàzzarìnì it will be replaced with andrea gazzarini. In our example:
contessa serbelloni, mazzanti viendalmare.

d) Intra-word delimiter removal

As briefly mentioned above, this latter filter will remove the intra-world delimiter, including spaces. Note,because this is the important thing, that after applying this filter, I'm not splitting the original input value in several tokens: the tokens is always 1 and will be:
contessaserbellonimazzantiviendalmare
That's all. As last note remember that the described chain should be applied both at index and query time. So running some examples will find that the requirements is fully satisfied.
  • Searching Contessa serbelloni mazzanti viendalmare will produce 1 result;
  • Searching serbelloni mazzanti will produce no result;
  • Searching Contessa serbelloni mazzanti will produce no result;
  • Searching Contessa serbel loni maz zanti vien dal mare will produce 1 result; ok, this could be intended as a violation and I agree with you...I'm thinking about that...in the meantime lets say that this "bug" is very useful because (the searcher) you couldn't know how the author name is exactly written;
Your comments will be very appreciated and of course, if you have some suggestions, enhancements, give me a shout.
Gazza