The concept is quite simple: Instead of ordering the facet result by tag count, it is ordered by sorting the terms. Where a classic facet result might be
- steak (102)
- tofu (80)
- ice cream (47)
- pasta (47)
- ratatouile (32)
the same search could return the term-sorted facet
- broccoli (5)
- crème brûlée (10)
- ice cream (47)
- lard (1)
- pasta (47)
At Statsbiblioteket, we use this to perform lookups in indexed titles and author names.
What seems to be the problem?
If we just perform a faceting on the indexed terms and sort by natural order (which is ~Unicode for Lucene), this will work just fine. Provided you stick to ASCII! This constraint makes it absolutely useless for anything but a very controlled vocabulary: If we sort by Unicode, the terms crab, crème & crow will be ordered crab, crow & crème. Removing diacritics does not work for us as some of the characters in Danish, notably æ, ø & å, does not have diacritics.
What seems to be the solution?
To get proper sorting in Java, one would normally use a Collator (we prefer the ICU Collator, but Java has one build-in). So somehow we want to apply Collator-sorting to our faceting.
Collator based sorting at index open
For a long time our faceting system worked by Collator sorting all the terms when the index were (re)opened. This gave the desired result, but as both author- and title-fields tend to have a lot of unique terms (about 18 million in our main index), it took us a about 250 seconds to prepare that part of the faceting structure upon an index update.
Collator keys to the rescue?
The current prevalent sorting solution in Lucene/Solr is to index ICU collator keys. ICU collator keys are byte-representations of terms with the property that a simple byte-by-byte comparison will result in the correct order. The terms crab, crème & crow will be converted to the bytes [2b 49 27 29 01 08 01 c1 00], [2b 49 2f 3f 2f 01 84 8f 06 01 bf 00] & [2b 49 43 53 01 08 01 c1 00] with the default ICU collator for danish. As can be seen, byte number 3 in each of those keys are the first byte that is not equal and those bytes (27, 2f & 43) does indeed signify the correct order for the input.
This works very well for sorting of documents, but does not work for faceting. The problem is that term->collationKey is a one-way transformation. We get the tags in the right order, but they are not human readable.
Extended collator keys to the rescue!
The solution is deceptively simple: Just store an UTF-8 representation of the original terms after their keys. Thus the terms crab, crème & crow will be converted to the bytes [2b 49 27 29 01 08 01 c1 00 63 72 61 62], [2b 49 2f 3f 2f 01 84 8f 06 01 bf 00 63 72 c3 a8 6d 65] & [2b 49 43 53 01 08 01 c1 00 63 72 6f 77].
The natural sort order is preserved and we can now extract the original term. This all works due to the nice property for ICU collator keys that they are always terminated with a 00. The 00 acts as terminator for comparison and as delimiter between the key and the term.
It actually works
There is a slight index-time penalty as the keys needs to be created and a slight increase in index size, but those are both academical. The change in the time used for creating the facet structures for title and author is substantial: That time fell from 250 seconds to 86 seconds.
Yes, the code is open source. Right now it must be digged out from the Exposed module in the Summa project, but a proper Lucene patch will be generated Real Soon Now.