Our home brew Sparse Faceting for Solr is all about counting: When calculating a traditional String facet such as
Author - H.C.Andersen (120) - Brothers Grimm (90) - Arabian Nights (7) - Carlo Collodi (5) - Ronald Reagan (2)
the core principle is to have a counter for each potential term (author name in this example) and update that counter by 1 for each document with that author. There are different ways of handling such counters.
Level 0: int
Stock Solr uses an
int to keep track of the term counts, meaning that each unique term takes up 32 bits or 4 bytes of memory for counting. Normally that is not an issue, but with 6 billion terms (divided between 25 shards) in our Net Archive Search, this means a whopping 24GB of memory for each concurrent search request.
Level 1: PackedInts
Sparse Faceting tries to be clever about this. An
int can count from 0 to 2 billion, but if the maximum number of documents for any term is 3000, there will be a lot of wasted space. 2^12 = 4096, so in the case of maxCount=3000, we only need 12 bits/term to keep track of it all. Currently this is handled by using Lucene’s
PackedInts to hold the counters. With the 6 billion terms, this means 9GB of memory. Quite an improvement on the 24GB from before.
Level 2: Long tail PackedInts with fallback
Packing counters has a problem: The size of all the counters is dictated by the maxCount. Just a single highly used term can nullify the gains: If all documents share one common term, the size of the individual counters will be log(docCount) bits. With a few hundred millions of documents, that puts the size to 27-29 bits/term, very close to the int representation’s 32 bits.
Looking at the Author-example at the top of this page, it seems clear that the counts for the authors are not very equal: The top-2 author has counts a lot higher than the bottom-3. This is called a long tail and it is a very common pattern. This means that the overall maxCount for the terms is likely to be a lot higher than the maxCount for the vast majority of the terms.
While I was loudly lamenting of all the wasted bits, Mads Villadsen came by and solved the problem: What if we keep track of the terms with high maxCount in one structure and the ones with a lower maxCount in another structure? Easy enough to do with a lot of processing overhead, but tricky do do efficiently. Fortunately Mads also solved that (my role as primary bit-fiddler is in serious jeopardy). The numbers in the following explanation are just illustrative and should not be seen as the final numbers.
The counter structure
We have 200M unique terms in a shard. The terms are long tail-distributed, with the most common ones having maxCount in the thousands and the vast majority with maxCount below 100.
We locate the top-128 terms and see that their maxCount range from 2921 to 87. We create an int to keep track of their counts and call it head.
The maxCount for the terms below the 128 largest ones is 85.
2^7=128, so we need 7 bits to hold each of those. We allocate a PackedInt structure with 200M entries of 7+1 = 8 bits and call it tail.
The tail has an entry for all terms, including those in head. For each of the large terms in head, we locate its position in tail. At the tail-counter, we set the value to term’s index in the head counter structure and set the highest bit to 1.
Let’s say that head entry term h_0 is located at position 1 in tail, h_126 is located at position 199,999,998 and h_127 is located at position 199,999,999. After marking the head entries in the tail structure, it would look like this:
Hanging on so far? Good.
Incrementing the counters
- Read the counter value from the tail structure:
count = tail.get(ordinal)
- Check if bit 7 is set:
if (count & 128 == 1)
- If bit 7 is set, increment the head
counter: head.inc(count & 127)
- If bit 7 is not set, increment the tail counter:
Pros and cons
In this example, the counters takes up 6 billion * 8 bits + 25 * 128 * 32 bits = 5.7GB. The performance overhead, compared to the PackedInts version, is tiny: Whenever a head bit is encountered, there will be an extra read to get the old head value before writing the value+1. As head will statistically be heavily used, it is likely to be in Level 2/3 cache.
This is just an example, but it should be quite realistic as approximate values from the URL field in our Net Archive Index has been used. Nevertheless, it must be stressed that the memory gains from long tail PackedInts is highly dictated by the shape of the long tail curve.
It is possible to avoid the extra bit in tail by treating the large terms as any other term, until their tail-counters reaches maximum (127 in the example above). When a counter’s max has been reached, the head-counter can then be located using a lookup mechanism, such as a small HashMap or maybe just a linear scan through a short array with the ordinals and counts for the large terms. This would reduce the memory requirements to approximately 6 billion * 7 bits = 5.3GB. Whether this memory/speed trade-off is better or worse is hard to guess and depends on result set size.
The long tail PackedInts could implement the PackedInts-interface itself, making it usable elsewhere. Its constructor could take another PackedInts filled with maxCounts or a histogram with maxbit requirements.