What did I mean?

It has been a long standing wish to get a good did-you-mean service shipped with Summa. And by “did-you-mean service” I mean the little helpful tip that shows up underneath the text entry when you mistype something when doing a search. Note that I say “mistype” and not “misspell”, because a good did-you-mean service is a lot more complex than a spell checker.

Consider when I read aloud my wish list to my mom over the phone and I try to explain to her that I badly want “Heroes of Might and Magic”  for Christmas. This phrase being completely meaningless to her she types in the search field:

heroes of light and magic

Notice that this is indeed a correctly spelled phrase, but nonetheless not what she/I wanted. A good search engine would ask Did you mean: “heroes of might and magic”?.

On the other hand if a search engine runs on a database of bad monochrome underworld games and “Heroes of Might and Magic” wasn’t there, but instead the index contained a game called “Heroes of Fight and Magic” the search engine should of course suggest Did you mean: “heroes of fight and magic”? in stead.

So we’ve identified two things we want that a normal spellchecker doesn’t provide:

  • Consider each word in a query in the context of the whole phrase it appears in
  • Only suggest stuff that is actually in the index

The Code

After surveying what was available on the open source market we realized that none of the solutions out there did what we wanted. I was pointed at Karl Wettin’s work on LUCENE-626. Although Karl’s work is great, it’s not compatible with the new API in Lucene >= 3.0 and it has a hardwired dependency on Berkely DB that we could not accept. So I branched his work in order to bring it into 2010 and I am proud to say that I’ve now reached an almost-works state. You can find the code on GitHub: github.com/mkamstrup/lucene-didyoumean

The new thing about this is also that we are now engaging in upstream Lucene work, rather than staying in our own Summa backyard. Quite exciting, and a very rewarding experience for a software developer. Toke has some more news in this regard as well – he’s doing some upstream stuff that has far bigger implications than my odd-job did-you-mean-hacking. But I’ll leave you hanging there and let Toke talk about this himself.

