When we’re Out There at conferences or visiting other libraries, we’re sometimes asked why we bother “developing our own search engine”. It is a really good question. One that we ask ourselves from time to time, to make sure we’re on the right track.
One of the reasons why we’ve been rolling our own instead of taking the obvious Solr-road is historical. Solr didn’t meet our needs when we started. It still doesn’t in its current form, but it’s getting close. Another reason is that the scope of Summa is different from Solr’s: We’ve invested a lot of energy in handling data-flow with transformations in a multi-stage system with a persistent storage service that handles parent-child relations and keeps track of updates.
The interesting thing here is that Summa and Solr are not necessarily in competition. Solr has a lot of great analyzers and has gained a lot of momentum during the last years. We would therefore like to use Solr at the core of Summa instead of the custom Lucene searcher we have currently. At Statsbiblioteket we have done a fair amount of work with low-memory index-lookup, sorting and hierarchical faceting which we’d like to take with us, so this calls for participation in Solr development.
Relevance Correlation between systems
Statsbiblioteket has recently begun collaboration with the Summon team from Serial Solutions. Due to Danish legislation we are not allowed to let Summon index our own data, so the upcoming search setup needs to merge results from Summon with our own. As both systems sort the results by relevance ranking, proper merging is quite hard. This is due to the fact that relevance scores are not comparable between different systems. One solution is to perform a local comparison of the returned fields in the results, but as this completely ignores the whole term statistics based ranking of Lucene/Solr, we would like something better.
Up until now we have been spoiled by using relevance ranked integrated search across different sources where we controlled the full indexing process. SOLR-1632 seems to be the right way to provide similar functionality for Solr, but it is not mature yet and is quite a large thing to put into production. Better to walk before running. As part of our collaborative agreement Serial Solutions has agreed to deliver statistics about their metadata to us on an experimental basis. With this we can hopefully do better merging and issue boosted queries to get the hits that are relevant to us, thus approximating the results we would have gotten if all metadata had been in a single index.