Warning: Bit-fiddling ahead.
The initial driver for implementing Sparse Faceting was to have extraction-time scale with the result set size, instead of with the total number of unique values in the index. From a performance point of view, this works very well in the current implementation: A side car structure keeps track of which counters has been updated. It is a simple array of pointers (integers really) and sampling indicates that a size of about 8% of the total amount of counters works okay.
So 8%, expressed as Java integers? That is
Less abstract, we have a use case with 600M values/shard. Rounding each counter up to nearest power of 2, the theoretical lower bound of memory needed for representing the counters are all the bits that can change during updates. For a sample shard that is about 140MB.
N-plane counters for the same shard takes up 144MB + 157MB, where the first 144MB are shared between all counter instances. So they are pretty close to the theoretical lower bound – at least the extra instances. Note: They can be forced to be practically at the lower bound, but that impacts performance.
Back to the tracker, because N-plane needs one to scale its performance with result set size. Tracker size was
4*0.08*#values bytes, which for the test shard is 192MB. For those extra counter instances, the tracker ends up using more memory that the counters it tracks.
What we need is something better to efficiently keep track of the touched values, where efficient is measured both in time and space. In fancy talk, that is a succinct data structure.
With n-plane counters, each bit of a given counter is stored on a separate plane (at least conceptually): If a counter has a max of 5, it needs 3 bits to store the count and is thus spread across the lower 3 planes.
Suppose we treated the lowest plane separately from the rest of the planes: Instead of supplying the bit at position 0 for the usual binary representation, it simply states if 1 should be added to the number or not. Confused? Let’s represent the numbers from 0 to 8 in standard binary as well as our new system:
Decimal 0 1 2 3 4 5 6 7 8 Binary 0 1 10 11 100 101 110 111 1000 Binary+ 0 1 11 101 111 1001 1011 1101 1111
Note that the representation is not quite as compact as true binary; e.g. the tipping point for going from 3 to 4 planes is from 7 to 8 for binary, but from 4 to 5 for binary+. Loosely estimated, it will cost 1 extra bit/value to use the new representation. However, there is zero extra cost for counters with maximum value 1. It turns out that most of the outgoing links in our net archive corpus are unique, so about 400M of the 600M satisfies this. So the extra memory cost of the extended binary will be 200M / 8 bits/byte = 25MB.
Having a not-zero flag solves a performance issue with the current n-plane counters, but it also opens up for…
Bitmap based sparse tracking
The goal here is to quickly track & locate the counters that have a non-zero value. Underneath the hood, a plane in a n-plane counter is just an array of longs. With implicit markers, checking whether there is one or more counters with non-zero values in a group of 64 counters is a simple matter of checking whether a long is equal to 0.
With our 600M counters, that means about 10M checks, reading 75MB data sequentially from heap. Not bad, but still a substantial fixed overhead.
So what if we create a bitmap with a single bit representing each long in the backing structure of the first plane. If the backing long has 0 updated counters, the bitmap entry is 0, for everything else it is 1. That would require
#counters/64 bits or 1.2MB. Locating the non-zero counters then becomes a matter of reading 18K longs sequentially and checking for 0. Now the fixed overhead seems manageable.
The bitmap would of course have to be maintained during updates, introducing an extra get-set to the act of incrementing a counter. However, the old tracker used a single set (and simpler logic), so the difference is not that great.
New idea vs. tried and true
Memory wise the new tracker would need about 26.2MB for 600M counters, where the old one needs 192MB. An obvious improvement.
For very low hit counts, the new tracker would loose: A fixed overhead of 18K checks is still more work than loading a few values directly from an array.
The complexity of the new tracker is far higher, but it might turn out to be fastest in most cases as iteration of the tracked counters are guaranteed to be in sequential order, where the existing tracker is random order.