Faceting in Solr works well out of the box up to some millions of unique values in the facet field. There is a small performance penalty linear to the number of unique values, which begins to show after some point. Sparse faceting solves this and makes faceting performance linear to the result size. Sparse faceting also reduces memory requirements by packing the values tighter. As seen in the blog post Long tail packed counters for faceting, this packing can be improved, depending on the concrete term distribution in the facet field.

An implementation of Long tail packed counters has been created and renamed to *Dual-plane* counters. During development, a new idea sprung to life in the form of *N-plane* counters. In this blog post the different counters will be explained and performance measurements from a micro-benchmark (yes, we know, they are not very reliable) based on real-world data will be presented.

Quick re-cap from the long tail blog post:

- We have hundreds of millions of counters, referenced by ordinal
- Each counter corresponds to a unique term in a facet field
- We know the maximum for each individual counter
- The overall distribution of the maxima is long tail
- We want the overall counter structure to be small
- We want the overall counter structure to be fast

Instead of viewing the maxima as plain numbers, they can be viewed as required bits. If a counter has a maximum of 15, all possible values can be expressed with 4 bits, as 2⁴-1 = 15. Similarly a counter with a maximum of 100 needs 7 bits, as 2⁷-1 = 127. The collection of counters can be visualized as

The counter for term *A* needs 1 bit, term *B *needs 3 bits, term *C* needs 1 bit and term *D* needs 11 bits. If we sorted the terms according to maxima instead of alphanumeric, the distribution would look like

However, as the order of the terms is dictated by Solr, sorting the terms for use with faceting would require a mapping from Solr’s term ordinals to the position in the sorted list. This option has not been explored yet, so for now we’ll stick with the original order.

## Test data

We extracted the maxima for the 640M unique links in a test-shard from our net archive and grouped them by the number of bits needed for expressing those maxima. The long tail distribution is evident. The theoretically optimal worst-case representation of the counters is the collective number of bits needed (the white squares seen above). For the sample below that is **138MB**.

Bits | #terms |
---|---|

1 | 425,799,733 |

2 | 85,835,129 |

3 | 52,695,663 |

4 | 33,153,759 |

5 | 18,864,935 |

6 | 10,245,205 |

7 | 5,691,412 |

8 | 3,223,077 |

9 | 1,981,279 |

10 | 1,240,879 |

11 | 714,595 |

12 | 429,129 |

13 | 225,416 |

14 | 114,271 |

15 | 45,521 |

16 | 12,966 |

17 | 4,005 |

18 | 1,764 |

19 | 805 |

20 | 789 |

21 | 123 |

22 | 77 |

23 | 1 |

## The contestants

**Stock Solr faceting** uses an int[] to hold the counts. It is simple and fast. In the visualization above, the white squares are bits holding counter values, while the red area represents bits that are always 0. There is a lot of red with this counter structure. The sample counters takes up **2442MB** with Stock Solr faceting. Pro: speed, con: space.

**Sparse Faceting** has the option of using PackedInts for counters. It lowers the overhead ceiling, but there is still a lot of wasted space. With Sparse Faceting, **1755MB** is needed for the sample counters. Pro: speed, con: space + needs to know overall maximum.

Holding low counts in one structure and switching high-counts to another structure requires 1 extra bit/value for signalling which counters has a high value (see Long tail packed counters for faceting for details). Even with the extra bit, this gives an overall size reduction. **Dual-plane** uses **1221MB** for the sample counters. Pro: speed, con: needs histogram of maxima for initialization.

Extending Dual-plane to an arbitrary number of planes and replacing the explicit pointers with counting logic, **N-plane** worst-case representation takes up just 2x the size of the raw bits. In the illustration above, the blue boxes are bits representing overflow up to the next plane. The overflow bits are counted to calculate the offset in the higher planes.

To avoid horrible performance, a rank structure is needed, which adds a bit of overhead. Likewise, it might be advisable to limit the number of planes, to avoid too many random memory requests. Most promising space/performance setup for the sample data takes up **341MB** for N-plane, with the smallest & slowest taking up 275MB. Pro: space, con: speed + needs all counter maxima for initialization.

Bonus: All the logistic parts of the structure (the blue squares and the rank-cache) are static and can thus be shared between counters. If there are more than 1 concurrent faceted search at a time, each extra counting structure only holds the theoretically smallest worst case of bits, which for this sample is **138MB**.

## Testing

The micro-benchmark creates 640M sample maxima, randomly generated to follow the bit-histogram from the links-field in our sample shard, and initializes the different counter structures based on this. A list of ordinals is created with random selection, checked to ensure that each unique ordinal only occurs a number of times that is within the corresponding counter’s maximum. Caveat: Random updates are not optimal as the distribution of updates is not random for real searches: It should follow the maxima-distribution. The updates are applied to each counter implementation and performance measured as the median from 9 test-runs.

N-plane is special as it can be configured in various ways. In the schema below, N-Plane(#4, 1/100) means that there are 4 planes and overflow-bits are cached for every 100 bits.

Implementation | MB | 1M upd | 5M upd | 10M upd | 50M upd | 100M upd | 500M upd |
---|---|---|---|---|---|---|---|

N-plane(#2, 1/1K) | 718 | 20853 | 13340 | 13059 | 15752 | 12590 | 4000 |

N-plane(#4, 1/1K) | 311 | 20887 | 13339 | 13058 | 15749 | 12554 | 3969 |

N-plane(#6, 1/1K) | 275 | 13375 | 20129 | 19492 | 11187 | 9482 | 4451 |

N-plane(#2, 1/100) | 745 | 20938 | 13548 | 13460 | 19146 | 17420 | 7674 |

N-plane(#4, 1/100) | 341 | 20929 | 13526 | 13447 | 18874 | 17035 | 7787 |

N-plane(#6, 1/100) | 306 | 20444 | 13317 | 13222 | 18948 | 17435 | 7212 |

N-plane(#2, 1/20) | 865 | 14027 | 18900 | 18773 | 13261 | 12380 | 10512 |

N-plane(#4, 1/20) | 473 | 13086 | 20533 | 20386 | 12388 | 11605 | 11554 |

N-plane(#6, 1/20) | 440 | 13423 | 21050 | 20909 | 12769 | 12003 | 11936 |

Dual-plane | 1221 | 18734 | 29952 | 29940 | 18797 | 18794 | 29911 |

PackedInts.COMPACT | 1755 | 13518 | 15324 | 15346 | 13562 | 13565 | 15296 |

PackedInts.FAST | 2442 | 17995 | 18955 | 19109 | 19123 | 19142 | 19233 |

## Observations

- Performance of N-plane gets worse with the number of updates, which is to be expected: As the counts gets higher, it needs to visit more planes.
- Dual-plane has suspicious values for 50M and 100M updates, which mirrors suspicious values for COMPACT at the same amunts of updates. As COMPACT should have the same updates/ms performance independent of the number of updates, this indicates a testing problem.
- PackedInts.FAST is a lot faster than PackedInts.COMPACT. This might be due to chance: The number of significant bits is 23, which aligns very badly with the 64 bit longs used by the COMPACT implementation. Had the number of significant bits been 21 or 24, things might have looked different.

## Conclusion

The two new players Dual-plane and N-plane are bot very interesting for high-cardinality faceting: Dual-plane as a direct replacement of PackedInts.COMPACT, N-plane as a very low-memory alternative when speed is not of the essence.

The drawback of Dual-plane is that is needs a histogram of maxima. Fortunately this is not costly: It is about the same work as is needed by PackedInts.COMPACT, which requires the overall maximum.

The drawback of N-plane is a lot higher as it needs the full maxima-set for initializing. Performance-wise this is not much of a change from the PackedInts.COMPACT maximum calculation, but it does require temporary memory in the order of 4*term_count, which in this case is 2.5GB.