Collator sorted facet results

The concept is quite simple: Instead of ordering the facet result by tag count, it is ordered by sorting the terms. Where a classic facet result might be

  • steak (102)
  • tofu (80)
  • ice cream (47)
  • pasta (47)
  • ratatouile (32)

the same search could return the term-sorted facet

  • broccoli (5)
  • crème brûlée (10)
  • ice cream (47)
  • lard (1)
  • pasta (47)

At Statsbiblioteket, we use this to perform lookups in indexed titles and author names.

What seems to be the problem?

If we just perform a faceting on the indexed terms and sort by natural order (which is ~Unicode for Lucene), this will work just fine. Provided you stick to ASCII! This constraint makes it absolutely useless for anything but a very controlled vocabulary: If we sort by Unicode, the terms crab, crème & crow will be ordered crab, crow & crème. Removing diacritics does not work for us as some of the characters in Danish, notably æ, ø & å, does not have diacritics.

What seems to be the solution?

To get proper sorting in Java, one would normally use a Collator (we prefer the ICU Collator, but Java has one build-in). So somehow we want to apply Collator-sorting to our faceting.

Collator based sorting at index open

For a long time our faceting system worked by Collator sorting all the terms when the index were (re)opened. This gave the desired result, but as both author- and title-fields tend to have a lot of unique terms (about 18 million in our main index), it took us a about 250 seconds to prepare that part of the faceting structure upon an index update.

Collator keys to the rescue?

The current prevalent sorting solution in Lucene/Solr is to index ICU collator keys. ICU collator keys are byte-representations of terms with the property that a simple byte-by-byte comparison will result in the correct order. The terms crab, crème & crow will be converted to the bytes [2b 49 27 29 01 08 01 c1 00], [2b 49 2f 3f 2f 01 84 8f 06 01 bf 00] & [2b 49 43 53 01 08 01 c1 00] with the default ICU collator for danish. As can be seen, byte number 3 in each of those keys are the first byte that is not equal and those bytes (27, 2f & 43) does indeed signify the correct order for the input.

This works very well for sorting of documents, but does not work for faceting. The problem is that term->collationKey is a one-way transformation. We get the tags in the right order, but they are not human readable.

Extended collator keys to the rescue!

The solution is deceptively simple: Just store an UTF-8 representation of the original terms after their keys. Thus the terms crab, crème & crow will be converted to the bytes [2b 49 27 29 01 08 01 c1 00 63 72 61 62], [2b 49 2f 3f 2f 01 84 8f 06 01 bf 00 63 72 c3 a8 6d 65] & [2b 49 43 53 01 08 01 c1 00 63 72 6f 77].

The natural sort order is preserved and we can now extract the original term. This all works due to the nice property for ICU collator keys that they are always terminated with a 00. The 00 acts as terminator for comparison and as delimiter between the key and the term.

It actually works

There is a slight index-time penalty as the keys needs to be created and a slight increase in index size, but those are both academical. The change in the time used for creating the facet structures for title and author is substantial: That time fell from 250 seconds to 86 seconds.

Yes, the code is open source. Right now it must be digged out from the Exposed module in the Summa project, but a proper Lucene patch will be generated Real Soon Now.

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in Faceting, Hacking, Lucene. Bookmark the permalink.

4 Responses to Collator sorted facet results

  1. jrochkind says:

    I’ve been looking at this too, for exactly facets too — and I had been thinking of storing original representation after the sort key too. I hadn’t tried it out yet, just had been thinking it through, very helpful to have your report.

    I would be thrilled if this got into Lucene (and eventually usable from Solr too?). Me, not being much of a Java expert, I was thinking of just converting to ICU sort key _outside_ of Solr in my own indexing routines, and then just submitting the sort key to Solr.

    An additional wrinkle though, is that I’d really like the ability to use facet.prefix on these properly sorted facet values too — I _think_ this sort key approach ruins that. Can you think of any way around it? It seems possible that converting the facet.prefix _query_ to an ICU sort key before running it might do the right thing… but it’s confusing to think about, and I don’t know if it can be counted upon.

  2. Toke Eskildsen says:

    I know that facet prefix will work, as we use the same functionality for index lookups (which are really just facet prefix queries, with the addition that we allow for a negative offset in order to provide the entries prior to the prefix). As you suggest, the trick is to generate a collator key for the lookup term.

    For pure faceting, the collatorKey+term looks really promising. I also like it for sort fields as it allows the sort value to be delivered with the search result, which is extremely handy when doing federated search.

    One downside is that a field with collatorKey+term works bad for regular queries. Plain term based searches are okay, as the input is automatically converted to collatorKey+term, but prefix- and range-queries does not work at all, as they are not analyzed. I guess they could be handled explicitly, but it seems to me that it would require a major amount of special purpose hooks. And fuzzy searches… I don’t even know where to begin.

    I talked to Robert Muir (okay, discussed on the mailing list) about using the collatorKey+term trick in Solr and he pointed out that while the idea is sound, the current Solr api does not allow for user-specified pre-processing of the input before the collatorKey is generated. This lack of normalization means that e.g. CD, cd and Cd will be represented as different entries, which makes it unusable for faceting. As we’re using a custom Analyzer, we made it normalize the way we wanted it, but that does not work as a general solution.

    …I really should open a JIRA issue for this.

  3. jrochkind says:

    Interesting, thanks.

    Yeah, it doesn’t seem like a problem that you can’t use these normalized fields for searching — I usually create seperate fields for searching vs facetting anyway, de-normalization is kind of the order of the day with lucene (or other similar products).

  4. jrochkind says:

    Have you gone anywhere else with this idea in the meantime?

    I’ve been thinking about it and exploring it more — and, initially, the unicode collation algorithm does not, surprisingly, seem to have the right properties to create the intended results for my use case, especially with regard to facet prefix searches based on generating a collator key for the lookup term.

    But your use case may be somewhat different. Sorry for being vague, this stuff is difficult to talk about comprehensibly.

    Have you been using this technique in production, using Unicode collation algorithm? Is it working out for you? I’d love some more details about the case you mention of “index lookups (which are really just facet prefix queries, with the addition that we allow for a negative offset in order to provide the entries prior to the prefix).”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s