Caveat lector: This is highly experimental. Don’t raise your arms too high, before my findings has been verified by more competent people than me.
Secondary warning: Geek level at high.
Collator (read: Locale) based String sorting in Lucene is normally handled by keeping the Strings from a given field in memory. We know that this is quite costly (
~50*#strings + 2*#chars bytes or something like 100MB for 1 million Strings with an average length of 25 characters). One known workaround is preprocessing with a parallel persistent structure, but this introduces a more complicated workflow and slows indexing speed.
A solution and a new problem
When we look at the SegmentReader from Lucene 3.0 trunk (20100318), we see that the terms can be accesses by ordinal by tweaking the code just a little bit. All that is needed to make a segment-based memory efficient sort is to make an array
sortedTerms with the ordinals and do a
Collections.sort with a comparator that performs a lookup for the Term for the ordinal. When we have the sorted terms, we run through them to determine the docIDs and create an array
sortedIDs[docID] == sortedTermIndex.
When a sorted search is started, the order of two docIDs can be determined by comparing
sortedIDs[docIDB]. Delivering the actual String is done with
Using the PackedInts structure from LUCENE-1990, we arrive at a memory overhead for the structures of
(#terms*log2(#terms) + #documents*log2(#terms)) / 8 bytes.
For an index of 1 million documents with a unique Term of 25 characters for each document, this is 5MB. For 10 million, it is 55MB, where holding the Strings fully in memory would take 1GB. Temporarily we also need something like
2 * #terms * 4 bytes for sorting plus a few MB for caching.
Nice theory, but… I tried implementing this and it failed miserably. Even with a 20,000 term cache, it took 20 minutes for opening 10 million unique terms. On a machine with i7 processor and solid state drive. The operation was a success, but the patient died.
The problem, of course, is that random access to the Terms is not very fast due to the half-streaming approach that Lucene uses. This is not a critique, as it saves a lot of space. The term sorter was Java’s merge sort and although it has high spatial locality, 20% of the lookups was cache misses. More than enough to confine usage of the method to cases where the index is rarely re-opened.
A solution to the new problem
So, how do we go about switching random access sorting with sequential access sorting? It turns out that this is really simple. Okay, simple as balancing an egg on the pointy end, I guess.
- Sort the array of ordinals in chunks that fit the cache size exactly. The Java merge sort is beautiful as it never (okay, this is just a guess – I need to verify) requests a term at position x before it has requested the term at position x-1 has been requested. This means perfect sequential access and very fast sorting.
We now have
#terms / cacheSizechunks, each sorted in Collator order.
- Merge the chunks all at a time. Use a heap to determine which the chunk to take the next term from.
In theory this means a lot of random access. However, the terms in a Lucene index is already sorted in Unicode order, so in reality there is very little switching between chunks and a lot of sequential access.
Does it work?
Initializing the sort structure for 10 million unique terms takes about 6 minutes in the current tree-day-hack implementation. Using the laptop’s conventional 7200 RPM harddisk.
This is still a non-trivial amount of time, but here’s the real eye opener: Loading all the Strings into RAM and calling
Arrays.sort(myStrings, myCollator) takes 10 minutes.
Storage-based sorting faster than in-memory? How is that possible? First of all, retrieving the Strings from storage takes comparatively little time (think a few seconds) as long as we do it sequentially. Second, when we merge the chunks using the heap, we observe that values pushed on the heap very rarely goes deeper than first or second level: The Strings are already in unicode-order which is fairly well in sync with locale-based sorting (for danish at least, but I'll bet a cake that it holds for most locales). The heap-sorting step is in reality very close to
O(n). By exploiting the existing order, we gain 40% speed over the generic sort (YMMW).
Where’s the catch?
As the terms are not in-memory, a high number of requests for the actual Strings takes its toll on IO. For a simple search, where we just want to return the top 20 hits, there are no problems at all. For a faceting system delivering thousands of scattered terms from different fields, this hurts on a conventional harddisk. But hey, we’re all using SSDs from now on, right?
A secondary catch is that the processing time payment is up front. However, if we don’t do it at startup, the users has to pay the price when they search. Keeping in mind that Lucene also uses a heap for collecting hits (and not the 10-minute mergesort mentioned above), this means we’re talking about ~600 ms for doing a sorted search with 1 million hits on my machine. Using the low-memory approach, it takes ~60 ms. Slow sorted searches or long startup time? You decide.
Don’t tell, show!
I’ve created a Lucene fork at http://github.com/tokee/lucene/ with the tag
lucene_3_0_exposed. Check it out, do an
ant and run
java -cp /home/te/projects/lucene/build/lucene-core-3.1-dev.jar org.apache.lucene.index.ExposedPOC. Or you could just download the jar and skip the compiling.
The POC stands for Proof Of Concept. It is guaranteed to be buggy and only supports optimized indexes. The code is ugly with obvious optimizations not implemented. [Insert more warnings at will]. It should be able to demonstrate the point though. Try it on a index with millions of terms to sort on to really see the action.
The sky is the limit
The implications of the sorted-ordinal-array approach is not limited to sorting. Doing a little dance and hitting the collector right there means that we can use it for faceting and index lookup too. But that’s a story for another blog post.
Moving towards the real world
It’ll take a lot of work to expose this way of sorting to the upper-most API. Before I venture further on this project, I want to be double sure that it is feasible and that I haven’t missed something obvious. Thus this project is on the back-burner until the greater minds at Lucene-dev has had their say.
Thanks to Mads Villadsen and other co-workers for listening to my ramblings on the subject and giving valuable input. Thanks to the State and University Library for allowing me to make contributions to Open Source projects in work-time.
Toke Eskildsen, IT-developer at the State and University Library, Denmark