The SOLR-5894 issue “Speed up high-cardinality facets with sparse counters” is close to being functionally complete (facet.method=fcs and facet.sort=index still pending). This post explains the different tricks used in the implementation and their impact on performance.
Most of the different Solr faceting methods (fc & fcs; with and without doc-values; single- and multi-value) uses the same overall principle for counting tag occurrences in facets:
- Allocate one or more counters of total size #unique_tags
- Fill the counters by iterating a hit queue (normally a bitmap) and getting corresponding counter indexes from a mapper
- Extract top-x tags with highest count by iterating all counters
There are 3 problems with this 3 step process: Allocation of a (potentially large) structure from memory, iteration of a bitmap with #total_documents entries and iteration of a counter with #unique_tags. Ideally this would be no allocation, iteration of just the IDs of the matched documents and iteration of just the tag counters that were updated. Sparse facet counting solves 2 out of the 3 problems.
In this context sparse is seen as performance-enhancing, not space-reducing. SOLR-5894 solves the extraction time problem by keeping track of which counters are updated. With this information, the extraction process no longer needs to visit all counters. A detailed explanation can be found at fast-faceting-with-high-cardinality-and-small-result-set. However, there are some peculiarities to sparse tracking that must be considered.
The black line is Solr field faceting on a multi-valued field (3 values/document), the red line is the sparse implementation on the same field. When the result set is small, sparse processing time is markedly lower than standard, but when the result set is > 10% of all documents, it becomes slower. When the result set reaches 50%, sparse takes twice as long as standard.
This makes sense when one consider that both updating and extraction of a single counter has more processing overhead for sparse: When the number of hits rises, the accumulated overhead gets bad.
Maximum sparse size
Okay, so tracking does not make much sense past a given point. Besides, having a tracker the size of the counters themselves (100% overhead) seems a bit wasteful. Fixing the tracker size to the cross-over-point is the way to go. We choose 8% here. Thanks to the beauty of the tracking mechanism, exceeding the tracker capacity does not invalidate the collected results, it just means a logical switch to non-track-mode.
No doubt about where the sparse counter switches to non-sparse mode. Note how the distance from Solr standard (black line) to sparse with tracker-overflow (red line past 8%) is near-constant: Up until 8% there is an overhead for updating the tracker. When the tracker has overflowed that overhead disappears for the rest of the counter updates, but the cycles used for tracking up to that point are wasted.
So memory overhead was reduced to 8% and performance was better for the very high hit counts, but still quite a bit worse than Solr standard. If only we could foresee if the sparse tracker would be overflowed or not.
We cannot determine 100% whether the tracker will be blown or not (at least not in the general case), but we can guess. Under the assumption that the references from documents to tags are fairly uniformly distributed, we can use the hit count (which we know when we start facet calculation) to guess whether the number of updated tag counts will exceed the tracker capacity.
The chart demonstrates how bad guessing of the result size affects performance. The conservative guess (red line) means that many of the faceting calls are done by falling back to standard Solr and that the sparse speed-up is wasted. The optimistic guess (cyan line) has a higher risk of failed sparse-attempts, which means bad performance. In this example, the bad guess was around 10%. Still, even with such hiccups, the overall positive effect of using sparse counting is clear.
The best cut-off point for sparse/non-sparse attempt depends on the corpus and the searches, as well as the willingness to risk increased response times. Properly tuned and with a corpus without extreme outliers (such as a single very popular document referencing 10% of all tags), the result will be very satisfying.
For the low price of 8% memory overhead we get much better performance for small result sets and no penalty for larger result sets (under the assumption of correct guessing).
Doing a little instrumentation it becomes clear that it is by no means free just to allocate a new counter structure with each facet call and throw it away after use. In the example above, 5M*3*4byte = 60MB are used for a counter. With a 2GB heap and otherwise non-tweaked execution of Solr’s start.jar, the average time used to allocate the 60MB was 13ms!
An alternative strategy is to keep a pool of counters and re-use them. This means that counters must be cleared after use, but this is markedly faster than allocating new ones. Furthermore this can be done by a background thread, so that the client can get the response immediately after the extraction phase. Enabling this, the picture gets even better.
For very small result sets there is virtually no performance penalty for faceting.