Virtual Integrated Search

For a while it seemed that Integrated Search with a nice Discovery Interface coupled with a large Data Well was the answer to how libraries were going to let users find and use the multitudes of material they have access to.

Many different places have tried building their own data well (sometimes a large national data well) but most have given up. Why? Primarily because of the unwillingness of publishers to hand over data to every single data well but also because the very concept of a very large local data well has been made (at least somewhat) obsolete by the new “Web Scale Discovery” tools – such as Summon, Primo Central and EBSCOHost.

Some libraries are fine with using the standard discovery interfaces that these services provide and some would rather use their APIs and build their own interface on top, perhaps adding tight integration with their library system and other locally developed systems.

However this area is fast moving and just as it looked as if pretty much all metadata would eventually be available in all these systems EBSCO has decided not to allow their metadata to be indexed outside of their own system. They will however allow for “just-in-time” searches to be used. It appears as if the market is fragmenting back into federated search – and the problems with federated search are well-known and are what made us all pursue integrated search to begin with.

But can we do better this time around? Fortunately the answer is yes – if we can get a little help from our friends.

There are two main problems with federated search:

  • Response times: The entire federated system is only as fast as the slowest search node.
  • Merging: There is no meaningful way to merge different result sets as they can have vastly different sorting criteria.

We can’t do a lot about the problems with response times but fortunately the new systems are _a lot_ faster then the old ones, so hopefully it wont be that big an issue.

Merging has however gotten easier, strangely enough as a side effect of making the sorting of individual results more complicated. The magic here lies in relevancy ranking and the fact that pretty much all new systems are based on the same principles and code base (ie. Lucene/Solr).

So how does this work? The relevancy ranking of a given document in a query is based on different things but a major contributor is the term frequency–inverse document frequency.

We have chosen to call this concept Virtual Integrated Search as the end result is (potentially) virtually indistinguishable from from having a large local data well coupled with a true integrated search. For the Well11 and code4lib 2011 conferences we have prepared a first stab at an implementation and integration with our existing Search and Summa system. This is not much more than an internal beta version where a primary focus of the frontend has been to make is possible to tweak the way merging is performed.

What will it look like? The main purpose for us is to make it look to the user as if all the data is retrieved from a single source. Currently we present results from Summon in a box of its own but the plan is to simply go back and present all results in a single list.

This is something we will be working on and experimenting with in the coming months and we are quite excited about the possibilities.

This entry was posted in Conference, Hacking, Low-level and tagged , . Bookmark the permalink.

1 Response to Virtual Integrated Search

  1. Pingback: The Well Conference 2011 (Well 11) » Blog Archive » Officielt tag til Well 11-konferencen - konference om virtuelle databrønde

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s