This article explores the boundaries of the experimental sparse faceting code, both in terms of processing time and in terms of space requirements. The code base has just been updated and the new features are available as a Solr 4.8 compatible binary download, as well as source.
Faceting overview (you can skip this)
The core performance-critical part of sparse faceting as well as vanilla String fc-faceting for Solr can be divided in three phases:
- Count the occurrences of the terms in the facet field
- Find the X highest counters
- Resolve the Strings for the counters
Sparse faceting started with the idea of improving on #2 by only considering the counters that were actually touched by #1. Later on it improved on #3 when running distributed in SolrCloud. Now #1 is being handled by introducing multi-threaded counting.
Memory-wise there are three main faceting structures:
- A map from documents to term ordinals
- A counter structure, with 1 counter per unique term
- Structures for resolving ordinals to Strings
Memory overhead is very dependent on how the fields are indexed in Solr. The current recommendation for faceting is to index in a StrField (which stores the value verbatim) as well as enabling DocValues, which lowers memory requirements considerably. There is still a base cost of using DocValues (personal rule of thumb: 1 byte/String) with the current default DocValuesFormat, but that is allegedly hard to avoid without compromising performance significantly.
#1: In the usual case, a Solr index consists of multiple segments. In order to do faceting across these segments, a mapping from segment-ordinals to global ordinals is created. This scales linear to the number of terms in all the segments. In the special case of a fully optimized shards, there is no need for this mapping and thus the memory overhead of #1 will be zero. The Danish Net Archive Search uses fully optimized shards.
#2: Vanilla Solr uses a simple
int-array to hold the counters: Great for low- to medium-cardinality faceting, but quite costly (2.4GB per concurrent call) if you want to facet on 600M unique values. With n-plane counters for sparse faceting, the density of the counting structures is twice the theoretical lower limit for a single counter (pedantic note: It really is a bit more than twice the theoretical lower limit, as we round up to nearest power of two – when somebody implements an efficient arithmetic coding counter, give me a shout). For the shards in the Danish Net Archive, this is about 300MB. Extra counter instances, used for concurrent searches, takes up only the theoretical lower limit, or about 150MB.
To get fast calculation of the top-X counters after the counting step, an extra structure is needed, which brings the total overhead up to about 3-4 times the theoretical lower limit. This can be disabled, but with cardinality in the hundreds of millions, this will impose a very heavy performance penalty.
#3: Having paid the initial cost of using DocValues, the resolving of ordinals to Strings is practically free in terms of memory. With non-DocValued fields, some of the Strings are stored in memory to ensure performing lookups.
Multi threaded faceting
Threading aims to lower latency (response time). It only works when there is a surplus of CPU power, so the typical scenario is setups with heavy but infrequent searches. Infrequent just means that multiple concurrent searches are rare. Vanilla Solr supports multi threaded faceting in two ways:
- Multiple facet fields can be processed in parallel.
- Using facet.method=fcs, multiple segments can be processed in parallel. However, this comes with a non-trivial merge cost that involves resolving of extra Strings.
Recently in-segment threading was added to sparse faceting. Technically this also covers #2 above, with the added benefit of not needing the merge step.
The tricky part of in-segment threaded counting was to get fast thread-safe updates of the packed counter structures, but for once amount of counters helps: The counter updates are lock-free and opportunistic, using AtomicLongArrays compareAndSet, which is nearly as fast as plain array-updates. In the rare case of collisions, the thread waits the lowest amount of time possible, before trying again.
The two term occurrence counters relevant to this experiment are:
- n-plane, which is small but uses complex logic and requires multiple random memory accesses (an average of 2 reads + 2 writes) when a counter is incremented. Its size, including the sparse structure, is 500MB for this experiment.
- packed, which is a variant of Packed64Singleblock using an AtomicLongArray as backing store. It has relatively simple logic and requires only 1 read and 1 write, when a counter is incremented. It saves a bit of space compared to the 2,4GB for vanilla Solr int, but is three times larger (1,5GB) than n-plane.
As we have recently started re-indexing everything in the Danish Net Archive (and since yours truly was a bit too quick on the fingers and accidentally deleted a couple of terabytes of perfectly fine shards), we only have a small sample corpus to test on. Hopefully it will be somewhat representative.
Recap: The Net Archive Search at Statsbiblioteket uses fully optimized shards of 900GB / 250M documents. One of the fields is outgoing links from webpages. Each shard has 600M unique values in the links-field, with 6 billion references from documents to values in that field.
Solr needs 2.2GB of heap just to open 1 such shard and perform a non-faceted search. With n-plane faceting, the minimum heap requirement rises to 3.3GB. In production we use a max heap size of 8GB as we have multiple facet fields and want to have wriggle room.
Due to the aforementioned circumstances, we currently only have 4 such shards in our SolrCloud to test on. They are handled by a 16 core Xeon machine with 256GB of RAM, so the hardware power vs. index size is very favourable, compared to the expected full setup with 23 shards.
Testing was done with random words from a Danish dictionary, one request at a time. The amount of threads used for counting was varied between 1, 2 & 4.
The same graph, with 80 seconds as max for the y-axis:
In the legends, _t1 means 1 thread, _t2 is two threads etc. Some observations:
- There is something strange going on with packed in the very low end (1-1000 hits): Responses should be near-instantaneous in that area. This should be investigated further.
- packed is markedly faster relative to n-plane, although the overall speed means that the absolute difference is only visible from 100K hits and up.
- Threading gives a significant boost from 10M hits and up.
Note that the graphs only cover hit counts up to 100M, although there are 1000M documents in the index. Searching for random words does not really give that many hits: Besides the catch-all-term http, the second most popular term og (Danish for and) only occurs in ⅓ of the corpus.
Getting above 100M hits out of the 1000M documents in the current corpus is certainly possible by issuing or-queries, range-queries or similar, but it it not something that is likely to happen very often during interactive use.
Nevertheless, it must be stressed that worst-case performance is poor. Measuring the catch-all *:*-query (1,066,091,283 hits), the timing was:
|Threads||N-plane ms||Packed ms|
For interactive use, random response times in minutes is unacceptable. Besides adding copious amounts of hardware, some ideas for handling that problem are
- Start by doing a fast non-facet search to get hit count. If that number indicates long faceting time, show a warning to the user. Potentially this could be a rough countdown.
- Add sampling functionality, where the sample-count only uses the first 1M hits. Note that the hits are not processed in order of relevance, so the sample will be made up of hits from the beginning of the overall index structure (most likely the oldest records).
A more advanced sampler could advance through the result set to distribute the sampling points better.
The small amount of shards means that there is a lot of CPU power free on the test machine. As it gets filled, there will be less surplus power and the effect of threading is likely to go down. Our setup might end up not getting any advantage out of threaded counting.
On the other hand, the low heap overhead of faceting with n-plane means that we will probably enable faceting on the 600M unique values field. It makes sense to pay the 1100MB heap overhead per shard to get fine-grained graph analysis; especially when the cost of additional counter instances is only about 350MB.