N-plane packed counters for faceting

Faceting in Solr works well out of the box up to some millions of unique values in the facet field. There is a small performance penalty linear to the number of unique values, which begins to show after some point. Sparse faceting solves this and makes faceting performance linear to the result size. Sparse faceting also reduces memory requirements by packing the values tighter. As seen in the blog post Long tail packed counters for faceting, this packing can be improved, depending on the concrete term distribution in the facet field.

An implementation of Long tail packed counters has been created and renamed to Dual-plane counters. During development, a new idea sprung to life in the form of N-plane counters. In this blog post the different counters will be explained and performance measurements from a micro-benchmark (yes, we know, they are not very reliable) based on real-world data will be presented.

Quick re-cap from the long tail blog post:

  • We have hundreds of millions of counters, referenced by ordinal
  • Each counter corresponds to a unique term in a facet field
  • We know the maximum for each individual counter
  • The overall distribution of the maxima is long tail
  • We want the overall counter structure to be small
  • We want the overall counter structure to be fast

Instead of viewing the maxima as plain numbers, they can be viewed as required bits. If a counter has a maximum of 15, all possible values can be expressed with 4 bits, as 2⁴-1 = 15. Similarly a counter with a maximum of 100 needs 7 bits, as 2⁷-1 = 127. The collection of counters can be visualized as

Term counters as bits

The counter for term A needs 1 bit, term B needs 3 bits, term C needs 1 bit and term D needs 11 bits. If we sorted the terms according to maxima instead of alphanumeric, the distribution would look like

Shape of maxima sorted by maxima

However, as the order of the terms is dictated by Solr, sorting the terms for use with faceting would require a mapping from Solr’s term ordinals to the position in the sorted list. This option has not been explored yet, so for now we’ll stick with the original order.

Test data

We extracted the maxima for the 640M unique links in a test-shard from our net archive and grouped them by the number of bits needed for expressing those maxima. The long tail distribution is evident. The theoretically optimal worst-case representation of the counters is the collective number of bits needed (the white squares seen above). For the sample below that is 138MB.

Bits #terms
1 425,799,733
2 85,835,129
3 52,695,663
4 33,153,759
5 18,864,935
6 10,245,205
7 5,691,412
8 3,223,077
9 1,981,279
10 1,240,879
11 714,595
12 429,129
13 225,416
14 114,271
15 45,521
16 12,966
17 4,005
18 1,764
19 805
20 789
21 123
22 77
23 1

The contestants

Counter-representation with atomic numeric

Stock Solr: Counter-representation with atomic numeric (y goes to 32, only bit 1-16 shown)

Stock Solr faceting uses an int[] to hold the counts. It is simple and fast. In the visualization above, the white squares are bits holding counter values, while the red area represents bits that are always 0. There is a lot of red with this counter structure. The sample counters takes up 2442MB with Stock Solr faceting. Pro: speed, con: space.

Sparse Faceting: Counter-representation with PackedInts

Sparse Faceting: Counter-representation with PackedInts

Sparse Faceting has the option of using PackedInts for counters. It lowers the overhead ceiling, but there is still a lot of wasted space. With Sparse Faceting, 1755MB is needed for the sample counters. Pro: speed, con: space + needs to know overall maximum.

Dual-plane: Counter-representation in low-bits with references to high-bit

Dual-plane: Counter-representation in low-bits with references to high-bit

Holding low counts in one structure and switching high-counts to another structure requires 1 extra bit/value for signalling which counters has a high value (see Long tail packed counters for faceting for details). Even with the extra bit, this gives an overall size reduction. Dual-plane uses 1221MB for the sample counters. Pro: speed, con: needs histogram of maxima for initialization.

N-plane: Horizontal slicing of counters

N-plane: Horizontal slicing of counters

Extending Dual-plane to an arbitrary number of planes and replacing the explicit pointers with counting logic, N-plane worst-case representation takes up just 2x the size of the raw bits. In the illustration above, the blue boxes are bits representing overflow up to the next plane. The overflow bits are counted to calculate the offset in the higher planes.
To avoid horrible performance, a rank structure is needed, which adds a bit of overhead. Likewise, it might be advisable to limit the number of planes, to avoid too many random memory requests. Most promising space/performance setup for the sample data takes up 341MB for N-plane, with the smallest & slowest taking up 275MB. Pro: space, con: speed + needs all counter maxima for initialization.

Bonus: All the logistic parts of the structure (the blue squares and the rank-cache) are static and can thus be shared between counters. If there are more than 1 concurrent faceted search at a time, each extra counting structure only holds the theoretically smallest worst case of bits, which for this sample is 138MB.

Testing

The micro-benchmark creates 640M sample maxima, randomly generated to follow the bit-histogram from the links-field in our sample shard, and initializes the different counter structures based on this. A list of ordinals is created with random selection, checked to ensure that each unique ordinal only occurs a number of times that is within the corresponding counter’s maximum. Caveat: Random updates are not optimal as the distribution of updates is not random for real searches: It should follow the maxima-distribution. The updates are applied to each counter implementation and performance measured as the median from 9 test-runs.

N-plane is special as it can be configured in various ways. In the schema below, N-Plane(#4, 1/100) means that there are 4 planes and overflow-bits are cached for every 100 bits.

Median updates/ms of 640M counters with max bit 23, for different amounts of updates
Implementation MB 1M upd 5M upd 10M upd 50M upd 100M upd 500M upd
N-plane(#2, 1/1K) 718 20853 13340 13059 15752 12590 4000
N-plane(#4, 1/1K) 311 20887 13339 13058 15749 12554 3969
N-plane(#6, 1/1K) 275 13375 20129 19492 11187 9482 4451
N-plane(#2, 1/100) 745 20938 13548 13460 19146 17420 7674
N-plane(#4, 1/100) 341 20929 13526 13447 18874 17035 7787
N-plane(#6, 1/100) 306 20444 13317 13222 18948 17435 7212
N-plane(#2, 1/20) 865 14027 18900 18773 13261 12380 10512
N-plane(#4, 1/20) 473 13086 20533 20386 12388 11605 11554
N-plane(#6, 1/20) 440 13423 21050 20909 12769 12003 11936
Dual-plane 1221 18734 29952 29940 18797 18794 29911
PackedInts.COMPACT 1755 13518 15324 15346 13562 13565 15296
PackedInts.FAST 2442 17995 18955 19109 19123 19142 19233

Observations

  • Performance of N-plane gets worse with the number of updates, which is to be expected: As the counts gets higher, it needs to visit more planes.
  • Dual-plane has suspicious values for 50M and 100M updates, which mirrors suspicious values for COMPACT at the same amunts of updates. As COMPACT should have the same updates/ms performance independent of the number of updates, this indicates a testing problem.
  • PackedInts.FAST is a lot faster than PackedInts.COMPACT. This might be due to chance: The number of significant bits is 23, which aligns very badly with the 64 bit longs used by the COMPACT implementation. Had the number of significant bits been 21 or 24, things might have looked different.

Conclusion

The two new players Dual-plane and N-plane are bot very interesting for high-cardinality faceting: Dual-plane as a direct replacement of PackedInts.COMPACT, N-plane as a very low-memory alternative when speed is not of the essence.

The drawback of Dual-plane is that is needs a histogram of maxima. Fortunately this is not costly: It is about the same work as is needed by PackedInts.COMPACT, which requires the overall maximum.

The drawback of N-plane is a lot higher as it needs the full maxima-set for initializing. Performance-wise this is not much of a change from the PackedInts.COMPACT maximum calculation, but it does require temporary memory in the order of 4*term_count, which in this case is 2.5GB.

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, Faceting, Hacking, Low-level, Performance, Solr. Bookmark the permalink.

One Response to N-plane packed counters for faceting

  1. Pingback: Measuring N-plane time & space | Software Development at Statsbiblioteket

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s