Ten times faster

One week ago I complained about Solr’s two-phase distributed faceting being slow in the second phase – ten times slower than the first phase. The culprit was the fine-counting of top-X terms, with each term-count being done as an intersection between regular hits and a special search for the term in question.

Let’s have a quick look at the numbers from last week (note that the scales are logarithmic):

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

Imprecise facet counting aka cheating

The simple way to get fast distributed faceting for high cardinality fields is to skip the second phase and accept that the term counts for faceting might be a bit off, where “a bit” is highly dependent on corpus & query. An extremely quick ad-hoc test with our corpus suggested around 10% deviation from the correct counts. The skipping requires just 3 lines of code, strategically placed.

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

Apologies for the colors not matching between the charts. The median for plain Solr faceting is 1608 ms. For imprecise faceting counting, using sparse faceting for first phase, it is 168 ms. Quite a speed up! Read Whale hunting with Solr for an explanation of the weird response times below 100 hits.

Since we are already cheating and getting imprecise counts, we might as well limit the maximum count for each term. In our 12 shards, the maximum count for a single URL in any shard is a little below 5000, with the long tail very quickly dropping. Most counts are below 10 in a single shard. With a count of 5000, we need 13 bits to hold the counter, meaning 3.6 billion terms / 13 bits/term ~= 5.44 GB for all counter structures or about 0.45 GB / shard / thread. If we lower this max count to 255 / shard, so that a single counter fits in a byte, we get faster faceting and reduce the memory overhead to 3.6 GB total or 300 MB / shard / thread.

Alas, some of us think that all this cheating is getting out of hand…

Once more, with feeling!

It was possible to speed the first phase of Solr faceting by doing sparse counting, so let’s try that again: For the second phase, we do a near complete repetition of the first phase, so the counts for all terms in the facet field are calculated. However, instead of calculating and extracting the top-X terms, only the counts for the requested terms are extracted from the counting structure. Extraction of a count from the structure requires resolving of the ordinal for the term in question. This does take some time, but the hope was that this would not give too much overhead. So did it help?

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

 

This is getting a bit chaotic and it is hard to see all the cyan dots hiding between the green ones. Trusty old percentile plot to the rescue:

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting with sparse term lookup

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting with sparse term lookup

Now we can see! With the second phase of faceting being nearly as fast as first phase, total faceting time for small result sets is looking quite good. If we lump all the measurements for each method together, we get

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

Note how the median for skip_secondary is a lot lower than for the previous test run – it seems that someone else was using the machine at that time. The outputs has been verified by random inspection: It really only takes a few hundred milliseconds to facet on small result sets out of more than 3 billion documents on a single machine. Just as it should, one might add. It remains to be tested if smaller setups benefits just as much.

We’re not done yet

The implementation is not ideal as the exact same work – counting of all term occurrences – is very often done twice. It would be a lot better to cache it. When releasing a counter structure to the pool of structures, it could be released with a tag stating if it would probably be used again (first phase of distributed faceting) or if it would probably not be needed any more (after the second phase). Guesstimating, this should shave 30-40% of the total time for faceting with sparse term lookup.

Should anyone want to try sparse faceting for themselves, then visit https://github.com/tokee/lucene-solr and check out branch lucene_solr_4_8_sparse or lucene_solr_4_9_sparse. You will need an existing Solr index for proper testing. Refer to the file SparseKeys.java for options. The defaults works fine if the parameters facet.sparse=true and facet.sparse.termlookup=true are given and the requested facet field has over 10,000 unique values and facet.method=fc. To disable the second phase completely, add the parameter facet.sparse.skiprefinements=true. Proper documentation pending.

If you want to see this in Solr, visit SOLR-5894 and vote for it.

Update 20140826 22:26

To verify things, the experiment was repeated with 10 minute running time for each faceting method (as opposed to 3 minutes). This did not affect the conclusion, but might add a little bit of information.

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

The first thing to note is how response time rises near linear to result set size, when the result set is 5M or more. It would be interesting to investigate what happens around the 300M (8% of the corpus size) mark, which is the limit chosen for sparse faceting for this setup.

The second observation is that all methods except stock Solr (the blue dots) seems to have two clusters, one below 100ms and one above. As the no_facet method is single phase (note to self: Double check if this is true), this cannot be explained by the second phase being skipped. Maybe there is some caching effect? The queries should be as good as unique, so it is not just because of simple request caching.

For an alternative illustration, here’s the same data as above but without the logarithmic y-scale:

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, Faceting, Hacking, Low-level, open source, Performance, Solr, Uncategorized. Bookmark the permalink.

5 Responses to Ten times faster

  1. Pingback: Even sparse faceting is limited | Software Development at Statsbiblioteket

  2. Pingback: Sparse facet caching | Software Development at Statsbiblioteket

  3. Pingback: What is high cardinality anyway? | Software Development at Statsbiblioteket

  4. Pingback: Terabyte index, search and faceting with Solr | Software Development at Statsbiblioteket

  5. Pingback: Sudden Solr performance drop | Software Development at Statsbiblioteket

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s