Each time a user issues a search in our primary corpus, we perform faceting on 19 different fields. Some of those fields have a low amount of unique values (year, language, classification), some of them are quite heavy (author, title, semi-free keywords). We have a grand total of 38 million unique tags and 160 million references from 11 documents to the tags displayed as part of the faceting.
The way our faceting works is simple in principle: When a search is performed, an array of counters (a basic
int) is allocated. The array contains a counter for each unique tag (38 million in our case). All the internal IDs for the documents from the full search result are extracted and iterated. For each document ID, the reference structure provides the tag IDs, which corresponds exactly to entries in the counter array. For each tag ID, the counter for said tag is incremented. When all document IDs has been iterated, the counters are iterated and the ones with the highest counts are extracted.
If you followed that, congratulations. So, performance-wise, what is wrong with that approach? Yeah, the title was kind of a giveaway. Even if the search results in just a single hit, with just a single tag, we still need to iterate the full 38 million counters in order to extract the facet result. We need to clear it too, before the next run, so we're really talking 2 complete iterations, one of them involving sorting logic. Ouch. Or to be more precise: About 300ms of ouch.
So what do we do? Well, if we know that our result set is small, we could use a simple HashMap to hold our counters; with tag-IDs as keys and the counters themselves as values. We tried that some years ago. It did sorta-kinda work, but that approach had significant drawbacks:
- HashMaps are memory pigs and they tax the garbage collector. We do not want to insert hundreds of thousands of objects into them, when they are just used for highly temporary counting.
- We need to guess the size of the result from the start. Not just the number of hits in the search result, but the number of tags that they refer to collectively so as not to get unvieldy HashMaps. If we guess wrong, we need to start over or copy the values from the map into our full counting structure.
We abandoned our HashMap based sparse counter approach as our experiments showed that the dumb "just iterate everything all the time" performed better for most of our searches.
Summing up the requirements for a faceting system where tag extraction performance is dependent on the number of found tags, rather than using a fixed amount of time:
- It should work without knowing the number of tags in advance.
- It should not tax the heap nor the garbage collector excessively
- Extraction time should be linear (we accept
for sorted extraction) to the number of marked tags
Mikkel Kamstrup Erlandsen kindly pointed me to the article Using Uninitialized Memory for Fun and Profit. With a little tweaking, a simplified version should satisfy all the requirements. We will build a counting structure that can switch seamlessly from sparse to non-sparse representation.
For this, we introduce yet another array: The tag count tracker. It holds pointers into the tag count array from before. Its length is the cutoff for when to use sparse counting and when to use full counting and must be determined by experiments.
When the count for a tag for a document needs to be incremented, we start by loading the old count from our tag count array (we need to do this anyway in order to add 1 to the count). If the count is 0, the position of the counter is added to the tag count tracker. If this overflows the tag count tracker, we switch to standard counting and completely ignore the tag count tracker hereafter. Either way, the value from the tag count array is incremented and stored back into the array as we would normally do.
When all the tags for all the documents has been iterated, the tag count tracker (if not overflowed) contains a complete list of all the tags that has a count of 1 or more. The tag extracter needs only iterate those and, just as important, needs only clear those. If the tag count tracker was overflowed, the old iterate-everything approach is used. As for clearing the tag count tracker for next use, it is simply a matter of setting its logical length (a single int) to 0. Presto!
Now it just needs to be implemented.