What is high cardinality anyway?

October 2, 2014 by

An attempt to explain sparse faceting and when to use it in not-too-technical terms. Sparse faceting in Solr is all about speeding up faceting on high-cardinality fields for small result sets. That’s a clear statement, right? Of course not. What is high, what is small and what does cardinality mean? Dmitry Kan has spend a lot of time testing sparse faceting with his high-cardinality field, without getting the promised performance increase. Besides unearthing a couple of bugs with sparse faceting, his work made it clear that there is a need for better documentation. Independent testing for the win!

What is faceting?

When we say faceting in this context, it means performing a search and getting a list of terms for a given field. The terms are ordered by their count, which is the number of times they are referenced by the documents in the search result. A classic example is a list of authors:

  The search for "fairy tale" gave 15 results «hits».

  Author «field»
   - H.C. Andersen «term» (12 «count»)
   - Brothers Grimm «term» (5 «count»)
   - Lewis Carroll «term» (3 «count»)

Note how the counts sums up to more than the number of documents: A document can have more than one reference to terms in the facet field. It can also have 0 references, all depending on the concrete index. In this case, there are either more terms than are shown or some of the documents have more than one author . There are other forms of faceting, but they will not be discussed here.

Under the hood

At the abstract level, faceting in Solr is quite simple:

  1. A list of counters is initialized. It has one counter for each unique term in the facet field in the full corpus.
  2. All documents in the result set are iterated. For each document, a list of its references to terms is fetched.
    1. The references are iterated and for each one, the counter corresponding to its term is increased by 1.
  3. The counters are iterated and the Top-X terms are located.
  4. The actual Strings for the located terms are resolved from the index.

Sparse faceting improves on standard Solr in two ways:

  • Standard Solr allocates a new list of counters in step 1 for each call, while sparse re-uses old lists.
  • Standard Solr iterates all the counters in step 3, while sparse only iterates the ones that were updated in step 2.

Distributed search is different

Distributed faceting in Solr adds a few steps:

  • All shards are issued the same request by a coordinating Solr. They perform step 1-4 above and returns the results to the coordinator.
  • The coordinator merges the shard-responses into one structure and extracts the Top-X terms from that.
  • For each Top-X term, its exact count is requested from the shards that did not deliver it as part of step a.

Standard Solr handles each exact-count separately by performing a mini-search for the term in the field. Sparse reuses the filled counters from step 2 (or repeats step 1-2 if the counter has been flushed from the cache) and simply locates the counters corresponding to the terms. Depending on the number of terms, sparse is much faster (think 5-10x) than standard Solr for this task. See Ten times faster for details.

What is cardinality?

Down to earth, cardinality just means how many there are of something. But what thing? The possibilities for faceting are many: Documents, fields, references and terms. To make matters worse, references and terms can be counted for the full corpus as well as just the search result.

  • Performance of standard Solr faceting is linear to the number of unique terms in the corpus in step 1 & 3 and linear to the number of references in the search result in step 2.
  • Performance of sparse faceting is (nearly) independent of the number of unique terms in the corpus and linear to the number of references in the search result in step 2 & 3.

Both standard Solr and sparse treats each field individually, so they both scale linear for that. The documents returned as part of base search are represented in a sparse structure itself (independent of sparse faceting) and scales with result set size. While it does take time to iterate over these documents, this is normally dwarfed by the other processing steps. Ignoring the devils in the details: Standard Solr facet performance scales with the full corpus size as well as the result size, while sparse faceting scales just with the result size.

Examples please!

  • For faceting on URL in the Danish Web Archive, cardinality is very high for documents (5bn), references (5bn) and terms (5bn) in the corpus. The overhead of performing a standard Solr faceting call is huge (hundreds of milliseconds), due to the high number of terms in the corpus. As the typical search results are quite a lot smaller than the full corpus, sparse faceting is very fast.
  • For faceting on host in the Danish Web Archive, cardinality is very high for documents (5bn) and references (5bn) in the corpus. However, the number of  terms (1.3m) is more modest. The overhead of performing a standard Solr faceting call is quite small (a few milliseconds), due to the modest number of terms; the time used in step 2, which is linear to the references, is often much higher than the overhead. Sparse faceting is still faster in most cases, but only by a few milliseconds. Not much if the total response time is hundreds of milliseconds.
  • For faceting on content_type_norm in the Danish Web Archive, cardinality is very high for documents (5bn) and references (5bn) in the corpus. It is extremely small for the number of unique terms, which is 10. The overhead of performing a standard Solr faceting call is practically zero; the time used in step 2, which is linear to the references, is often much higher than the overhead. Sparse faceting is never faster than Solr for this and as a consequence falls back to standard counting, making it perform at the same speed.
  • For faceting on author at the library index at Statsbiblioteket, cardinality is high for documents (15m), references (40m) and terms (9m) in the corpus. The overhead of performing a standard Solr faceting call is noticeable (tens of milliseconds), due to the 9m terms in the corpus. The typical search results is well below 8% of the full corpus, and sparse faceting is markedly faster than standard Solr. See Small Scale Sparse Faceting for details.

Sparse facet caching

September 19, 2014 by

As explained in Ten times faster, distributed faceting in standard Solr is two-phase:

  1. Each shard performs standard faceting and returns the top limit*1.5+10 terms. The merger calculates the top limit terms. Standard faceting is a two-step process:
    1. For each term in each hit, update the counter for that term.
    2. Extract the top limit*1.5+10 terms by running through all the counters with a priority queue.
  2. Each shard returns the number of occurrences of each term in the top limit terms, calculated by the merger from phase 1. This is done by performing a mini-search for each term, which takes quite a long time. See Even sparse faceting is limited for details.
    1. Addendum: If the number for a term was returned by a given shard in phase 1, that shard is not asked for that term again.
    2. Addendum: If the shard returned a count of 0 for any term as part of phase 1, that means is has delivered all possible counts to the merger. That shard will not be asked again.

Sparse speedup

Sparse faceting speeds up phase 1 step 2 by only visiting the updated counters. It also speeds up phase 2 by repeating phase 1 step 1, then extracting the counts directly for the wanted terms. Although it sounds heavy to repeat phase 1 step 1, the total time for phase 2 for sparse faceting is a lot lower than standard Solr. But why repeat phase 1 step 1 at all?

Caching

Today, caching of the counters from phase 1 step 1 was added to Solr sparse faceting. Caching is tricky business to get just right, especially since the sparse cache must contain a mix of empty counters (to avoid re-allocation of large structures on the Java heap) as well as filled structures (from phase 1, intended for phase 2). But theoretically, it is simple: When phase 1 step 1 is finished, the counter structure is kept and re-used in phase 2. So time for testing:

15TB index / 5B docs / 2565GB RAM, faceting on 6 fields, facet limit 25

15TB index / 5B docs / 2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed queries

Note that there are no measurements of standard Solr faceting in the graph. See the Ten times faster article for that. What we have here are 4 different types of search:

  • no_facet: Plain searches without faceting, just to establish the baseline.
  • skip: Only phase 1 sparse faceting. This means inaccurate counts for the returned terms, but as can be seen, the overhead is very low for most searches.
  • cache: Sparse faceting with caching, as described above.
  • nocache: Sparse faceting without caching.

Observations

For 1-1000 hits, nocache is actually a bit faster than cache. The peculiar thing about this hit-range is that chances are high that all shards returns all possible counts (phase 2 addendum 2), so phase 2 is skipped for a lot of searches. When phase 2 is skipped, this means wasted caching of a filled counter structure, that needs to be either cleaned for re-use or discarded if the cache is getting too big. This means a bit of overhead.

For more than 1000 hits, cache wins over nocache. Filter through the graph noise by focusing on the medians. As the difference between cache and nocache is that the base faceting time is skipped with cache, the difference of their medians should be the about the same as the difference of the medians from no_facet and skip. Are they? Sorta-kinda. This should be repeated with a larger sample.

Conclusion

Caching with distributed faceting means a small performance hit in some cases and a larger performance gain in other. Nothing Earth-shattering and as it works best when there is more memory allocated for caching, it is not clear in general whether it is best to use it or not. Download a Solr sparse WAR from GitHub and try for yourself.

Even sparse faceting is limited

September 11, 2014 by

Recently, Andy Jackson from UK Web Archive discovered a ginormous Pit Of Pain with Solr distributed faceting, where some response times reached 10 minutes. The culprit is facet.limit=100 (the number of returned values for each facet is 100), as the secondary fine-counting of facet terms triggers a mini-search for each term that has to be checked. With the 9 facets UK Web Archive uses, that’s 9*100 searches in the worst-case. Andy has done a great write-up on their setup and his experiments: Historical UKWA Solr Query Performance Benchmarking.

Pit Of Pain by Andy Jackson, UK Web Archive

Pit Of Pain by Andy Jackson, UK Web Archive

The shape of the pit can be explained by the probability of the need for fine-counts: When there is less than 1K hits, chances are that all shards has delivered all matching terms with count > 0 and thus need not be queried again (clever merging). When there are more than 1M hits, chances are that the top-100 terms in each facet are nearly the same for all shards, so that only a few of the terms needs fine-counting. Between those two numbers, chances are that a lot of the terms are not present in all initial shard results and thus require fine-counting.

While the indexes at Statsbiblioteket and UK Web Archive are quite comparable; 12TB vs. 16TB, build with nearly the same analysis chain, the setups differ with regard to hardware as well as facet setup. Still, it would be interesting to see if we can reproduce the Pit Of Pain™ with standard Solr faceting on our 6 facet fields and facet.limit=100.

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100

Sorta, kinda? We do not have the low response times < 100 hits, and 10 minutes testing only gave 63 searches, but with the right squinting of eyes, the Pit Of Pain (visualized as a hill to trick the enemy) is visible from ~1K to 1M hits. As for the high response times < 100 hits, it is due to a bad programming decision from my side – expect yet another blog post. As for the pit itself, let’s see how it changes when the limit goes down.

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100, 50 and 5

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100, 50 & 5

Getting a little crowded with all those dots, so here’s a quartile plot instead.

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100, 50 and 5

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100, 50 & 5

Again, please ignore results below 100 hits. I will fix it! Promise! But other than that, it seems pretty straight forward: High limits has a severe performance penalty, which seems to be more or less linear to the limit requested (hand waving here).

The burning question is of course how it looks with sparse faceting. Technically, distributed sparse faceting avoids the mini-searches in the fine-counting phase, but still requires each term to be looked up in order to resolve its ordinal (it is used as index in the internal sparse faceting counter structure). Such a lookup does take time, something like 0.5ms on average on our current setup, so sparse faceting is not immune to large facet limits. Let’s keep the y-axis-max of 20 seconds for comparison with standard Solr.

12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 100, 50 & 5

12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 100, 50 & 5

There does appear to be a pit too! Switching to quartiles and zooming in:

12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 100, 50 & 5

12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 100, 50 & 5

sparse_limit_10min_12.5TB_4.2B_sparse_finecount_l1000_100_50_5_nomax.png

This could use another round of tests, but it seems that the pit is present from 10K to 1M hits, fairly analogue to Solr fc faceting. The performance penalty of high limits also matches, just an order of magnitude lower. With worst-case of 6*100 fine-counts (with ~10^5 hits) on each shard and an average lookup time of ½ms, having a mean for the total response time around 1000ms seems reasonable. Everything checks out and we are happy.

Update 20140912

The limit for each test were increased to 1 hour or 1000 searches, whichever comes first, and the tests repeated with facet.limits of 1K, 10K and 100K. The party stopped early with OutOfMemoryError for 10K and since raising the JVM heap size skews all previous results, what we got is what we have.

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 1000

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 1000

Quite similar to the Solr fc faceting test with facet.limit=100 at the beginning of this post, but with the Pit Of Pain moved a bit to the right and a worst-case of 3 minutes. Together with the other tested limits and quartiled, we have

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 1000, 100, 50 & 5

Looking isolated at the Pit Of Pain, we have the median numbers

facet.limit 10^4 hits 10^5 hits 10^6 hits 10^7 hits
1000 24559 70061 141660 95792
100 9498 16615 12876 11582
50 9569 9057 7668 6892
5 2469 2337 2249 2168

Without cooking the numbers too much, we can see that the worst increase switching from limit 50 to 100 is for 10^5 hits: 9057ms -> 16615ms or 1.83 times, with the expected increase being 2 (50 -> 100). Likewise the worst increase from limit 100 to 1000 is for 10^6 hits: 12876ms -> 141660ms or 11.0 times, with the expected increase being 10 (100 -> 1000). In other words: Worst-case median response times (if such a thing makes sense) for distributed fc faceting with Solr scales lineary to the facet.limit.

Repeating with sparse faceting and skipping right to the quartile plot (note that the y-axis dropped by a factor 10):

    12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 1000, 100, 50 & 5

12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 1000, 100, 50 & 5

Looking isolated at the Pit Of Pain, we have the median numbers

facet.limit 10^4 hits 10^5 hits 10^6 hits 10^7 hits
1000 512 2397 3311 2189
100 609 960 698 939
50 571 635 395 654
5 447 215 248 588

The worst increase switching from limit 50 to 100 is for 10^6 hits: 395ms -> 698ms or 1.76 times, with the expected increase being 2. Likewise the worst increase from limit 100 to 1000 is also for 10^6 hits: 698ms -> 3311ms or 4.7 times, with the expected increase being 10. In other words: Worst-case median response times for distributed sparse faceting appears to scale better than lineary to the facet.limit.

Re-thinking this, it becomes apparent that there are multiple parts to facet fine-counting: A base overhead and an overhead for each term. Assuming the base overhead is the same, since the number of hits is so, we calculate this to 408ms and the overhead per term to 0.48ms for sparse (remember we have 6 facets so facet.limit=1000 means a worst-case of fine-counting 6000 terms). If that holds, setting facet.limit=10K would have a worst-case median response time of around 30 seconds.

A few Highlights from the Digital Libraries #DL2014 conference in London

September 10, 2014 by

This post is written at the DL2014 conference in London in September, where it’s warm and sunny :)

My visit started at the Digital Preservation Sustainability on the EU Policy Level workshop on Monday. This workshop had an interesting panel debate and brought together a number of EU projects in the digital preservation area. Jette Junge will probably blog about this workshop on the opf blog following up on this post.

Tuesday morning at the main conference started with a Keynote by Prof. Dieter Fellner from the Frauenhofer Institute, who talked about 3D-Digitization of Cultural Heritage. He told us that 90% of museum artifacts are hidden in archives, and he proposed using 3D-digitization for preservation and restoration, and maybe replication? He talked of his vision of 3D Cultural Heritage Artifacts as First Class Citizens in Digital Libraries, and he showed us an example of Manual 3D Digitization of “paper” from the Easter Islands. He also showed us some cool 3D scanner gadgets and asked the (rhetorical) question

Is this just implementation or is it research?

And he presented a number of research challenges:

  • Faster processing, Improved Workflows, Cost Reduction
  • Improve methods and standards for Geometry, Texture, Material Properties
  • Automated Markup and ….

There are already some projects working on speeding up the process: DOME, ORCAM, CultLab3D.de (industrial) and on serving the 3D models to the user using 3D internet, and X3DOM. There is the challenge of 3D Color Printing. And there are a number of Open Issues such as

  • Digital Rights
  • Signatures
  • Certification
  • Formats
  • Long Term Preservation

I see a bunch of interesting Research (and implementation) Challenges in this. I just find it difficult to put it into a State and University Library context ;-)

Next I went to (the important coffee break and next to) Session 1 on Preservation Strategies. This session had four interesting presentations. Note to self: acid.matkelly.com might be worth checking out (from the last presentation in this session Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript by Mat Kelly)!

At lunch time I went to the TPDL Steering Committee meeting (back stage in the dressing room for the women’s choir). It was a long meeting, but a good one. I’m confident that the TPDL conference is on the right path and in the right hands – at least for the immediate future.

In the afternoon I went to Session 4 on Building Systems. Also four interesting presentations. The second one was ours Bridging the Gap Between Real World Repositories and Scalable Preservation Environments presented by Per Møldrup-Dalum. It was with cool crowd-pleasing slides and the elephant in the library was a success again!

Then there was the Minute Madness: Poster and Demonstration Session Preview with lots of cool presentations and more people on stage than off it seemed. And the actual Poster and Demonstration Reception in the Garden Room.

The Garden

The Room

I think I’ll publish this now. I may write about today and tomorrow some time. Or I may tell some of you about it offline. Especially the State and University Library web guys. You should be here. Lots of presentations involving Solr and crowd sourcing. Will you join us next year?

 

Small scale sparse faceting

September 9, 2014 by

While sparse faceting has profound effect on response time in our web-archive, we are a bit doubtful about the amount of multi billion document Solr indexes out there. Luckily we also have our core index at Statsbiblioteket, which should be a bit more representative of your everyday Solr installation: Single-shard, 50GB, 14M documents. The bulk of the traffic are user-issued queries, which involves spellcheck, edismax qf & pf on 30+ fields and faceting on 8 fields. In this context, the faceting is of course the focus.

Of the 8 facet fields, 6 are low-cardinality and 2 are high-cardinality. Sparse was very recently enabled for the 2 high-cardinality ones, namely subject (4M unique values, 51M instances (note to self: 51M!? How did it get so high?)) and author (9M unique values, 40M instances).

To get representative measurements, the logged response times were extracted for the hours 07-22; there’s maintenance going on at night and it skews the numbers. Only user-entered searches with faceting were considered. To compare before- and after sparse-enabling, the data for this Tuesday and last Tuesday were used.

50GB / 14M docs, logged QTimes from production, without (20140902) and with (20140909) sparse faceting

50GB / 14M docs, logged timing from production, without (20140902) and with (20140909) sparse faceting

The performance improvement is palpable with response time being halved, compared to the non-sparse faceting. Fine-reading the logs, the time spend on faceting the high-cardinality fields is now in the single-digit milliseconds for nearly all queries. We’ll have to do some test to see what stops the total response time from getting down to that level. I am guessing spellcheck.

As always, sparse faceting is readily available for the adventurous at SOLR-5894.

Update 20140911

To verify that last Tuesday was not a lucky shot, here’s the numbers for the last 4 Wednesdays. Note that the amount of queries/day is fairly low for the first two weeks. This is due to semester start. Also note that the 10^8 hits (basically the full document set) were removed as those were all due to the same query being repeated by a dashboard tool.

50GB / 14M docs, logged timing from production. Only 20140909 is with sparse faceting

50GB / 14M docs, logged timing from production. Only 20140909 is with sparse faceting

Ten times faster

August 26, 2014 by

One week ago I complained about Solr’s two-phase distributed faceting being slow in the second phase – ten times slower than the first phase. The culprit was the fine-counting of top-X terms, with each term-count being done as an intersection between regular hits and a special search for the term in question.

Let’s have a quick look at the numbers from last week (note that the scales are logarithmic):

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

Imprecise facet counting aka cheating

The simple way to get fast distributed faceting for high cardinality fields is to skip the second phase and accept that the term counts for faceting might be a bit off, where “a bit” is highly dependent on corpus & query. An extremely quick ad-hoc test with our corpus suggested around 10% deviation from the correct counts. The skipping requires just 3 lines of code, strategically placed.

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

Apologies for the colors not matching between the charts. The median for plain Solr faceting is 1608 ms. For imprecise faceting counting, using sparse faceting for first phase, it is 168 ms. Quite a speed up! Read Whale hunting with Solr for an explanation of the weird response times below 100 hits.

Since we are already cheating and getting imprecise counts, we might as well limit the maximum count for each term. In our 12 shards, the maximum count for a single URL in any shard is a little below 5000, with the long tail very quickly dropping. Most counts are below 10 in a single shard. With a count of 5000, we need 13 bits to hold the counter, meaning 3.6 billion terms / 13 bits/term ~= 5.44 GB for all counter structures or about 0.45 GB / shard / thread. If we lower this max count to 255 / shard, so that a single counter fits in a byte, we get faster faceting and reduce the memory overhead to 3.6 GB total or 300 MB / shard / thread.

Alas, some of us think that all this cheating is getting out of hand…

Once more, with feeling!

It was possible to speed the first phase of Solr faceting by doing sparse counting, so let’s try that again: For the second phase, we do a near complete repetition of the first phase, so the counts for all terms in the facet field are calculated. However, instead of calculating and extracting the top-X terms, only the counts for the requested terms are extracted from the counting structure. Extraction of a count from the structure requires resolving of the ordinal for the term in question. This does take some time, but the hope was that this would not give too much overhead. So did it help?

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

 

This is getting a bit chaotic and it is hard to see all the cyan dots hiding between the green ones. Trusty old percentile plot to the rescue:

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting with sparse term lookup

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting with sparse term lookup

Now we can see! With the second phase of faceting being nearly as fast as first phase, total faceting time for small result sets is looking quite good. If we lump all the measurements for each method together, we get

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

Note how the median for skip_secondary is a lot lower than for the previous test run – it seems that someone else was using the machine at that time. The outputs has been verified by random inspection: It really only takes a few hundred milliseconds to facet on small result sets out of more than 3 billion documents on a single machine. Just as it should, one might add. It remains to be tested if smaller setups benefits just as much.

We’re not done yet

The implementation is not ideal as the exact same work – counting of all term occurrences – is very often done twice. It would be a lot better to cache it. When releasing a counter structure to the pool of structures, it could be released with a tag stating if it would probably be used again (first phase of distributed faceting) or if it would probably not be needed any more (after the second phase). Guesstimating, this should shave 30-40% of the total time for faceting with sparse term lookup.

Should anyone want to try sparse faceting for themselves, then visit https://github.com/tokee/lucene-solr and check out branch lucene_solr_4_8_sparse or lucene_solr_4_9_sparse. You will need an existing Solr index for proper testing. Refer to the file SparseKeys.java for options. The defaults works fine if the parameters facet.sparse=true and facet.sparse.termlookup=true are given and the requested facet field has over 10,000 unique values and facet.method=fc. To disable the second phase completely, add the parameter facet.sparse.skiprefinements=true. Proper documentation pending.

If you want to see this in Solr, visit SOLR-5894 and vote for it.

Update 20140826 22:26

To verify things, the experiment was repeated with 10 minute running time for each faceting method (as opposed to 3 minutes). This did not affect the conclusion, but might add a little bit of information.

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

The first thing to note is how response time rises near linear to result set size, when the result set is 5M or more. It would be interesting to investigate what happens around the 300M (8% of the corpus size) mark, which is the limit chosen for sparse faceting for this setup.

The second observation is that all methods except stock Solr (the blue dots) seems to have two clusters, one below 100ms and one above. As the no_facet method is single phase (note to self: Double check if this is true), this cannot be explained by the second phase being skipped. Maybe there is some caching effect? The queries should be as good as unique, so it is not just because of simple request caching.

For an alternative illustration, here’s the same data as above but without the logarithmic y-scale:

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

Ten times slower

August 15, 2014 by

I jumped the gun on our current web index status with our whale hunting safari, but it turns out that there are other fish to kill (sorry, whales are not fish and I’ll stop with the nautical metaphors immediately). This time we’re talking faceting on billions of documents.

At Statsbiblioteket we are building a SolrCloud web index of our harvested material. 9 months ago that was 372 TB or 10 billion entities. It has probably passed 400 TB by now. Our insane plan was to index it, put the indexes on a single machine and to do faceted searches. Because it made sense and maybe a little because it is pure awesome to handle that amount of information from a single box. Read about our plans in the post Webscale in Danish.

It has been about three months since we last checked how things were with our searchable web archive at Statsbiblioteket. Back then it contained 4 shards for a total of 3.6 TB or 1.2 billion documents, on our single 16 core machine with 256 GB of RAM. Simple searches were fast and faceting on URL (nearly unique for every document) equally so, even when we lowered the amount of RAM to 80 GB, which meant that only 1% of the total index data could be cached in RAM. The two graphs below illustrates the state back then.

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

Sparse faceting is markedly better than stock Solr under our test; performance is very satisfactory with most response times well below 1 second. It is important to note that the distributed faceting calls are executed with a single thread in each shard for this setup. This means that only 4 CPUs were fully utilized at a time during a single faceted search.

Yesterday we reached 12 shards for a total of 10.7 TB of index data in 3.6 billion documents. Turning off the artificial RAM limitation left us with 140 GB of RAM for disk cache or 1.3% of the full index size. So more RAM for cashing than we had with 4 shards and still plenty of CPU power as the machine has 16 cores (*2 if you count HyperThreading). Of course the merging gets a little heavier, but not too much. In an ideal world this would mean that the performance would be unchanged, right?

No, we did not think so either. So how bad is it?

 

256GB RAM, 1 thread, 12 shards, 10TB, random words, Solr & sparse faceting on URL

256GB RAM, 1 thread, 12 shards, 10TB, random words, Solr & sparse faceting on URL

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL

Ouch. 2½ second median? That’s bad! But wait a minute… Doesn’t that percentile plot look like a whale? And how come this is so much slower than our 4 shard setup and how come it is faster to facet on 100,000 values than it is to facet on 10,000? Time to check the logs.

Distributed Solr faceting is two-phase. First phase is standard faceting (find the top-X facet terms for a given search). The merger then collects the results, sums them and extracts the collective top-X terms. Second phase is to ask the shards for the counts for those terms, to get the correct count as the terms might not have been returned from all shards in the first phase. The merger is smarter than that, but the principle holds.

It seems logical that second phase is faster than first phase: It just has to calculate counts for a limited amount of specified terms instead of performing full facet processing. Let’s go back to the logs and plot the response times for first and second phase separately. Note that these numbers are from the 12 Solr logs at shard-request level and not at the full distributed call level.

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

There goes that logical conclusion. The second phase takes more than 10 times as long as first phase! What is happening? We need to dig deeper and look at the surprisingly simple code for second phase:

private NamedList getListedTermCounts(String field, String termList, DocSet base) throws IOException {
  FieldType ft = searcher.getSchema().getFieldType(field);
  List<String> terms = StrUtils.splitSmart(termList, ",", true);
  NamedList<Integer> res = new NamedList<Integer>();
  for (String term : terms) {
    String internal = ft.toInternal(term);
   int count = searcher.numDocs(new TermQuery(new Term(field, internal)), base);
   res.add(term, count);
 }
 return res;
}

We have the result of the query in the bitset base; with shards of 300 million documents, that is a rather large bitset. For each term in the second phase facet request, a simple search is performed for facet_field:specific_term. This results in another bitset. The number of intersecting bits in these two sets is the count for the term.

The problem here is that we are doing intersections of very large bitsets. Potentially they can be represented by compact hashmaps or other structures, but the log tells us that this still takes quite a lot of time for a corpus of this size. Time that grows as the number of shards grows.

Guessing time: If at least one of the bitsets is a bitmap with 1 bit per document in the index, that takes up about 40 MB of heap, which is accessed when doing the intersection. If there are 20 terms in the request (quite likely as we ask for top-20 on URL), this is done 20 times. So a least 800MB of memory is accessed. With 12 shards doing faceting in parallel, this is just shy of 10 GB of memory. It is hard to measure memory wait time, but it seems a likely culprit that we are simple waiting for main memory.

The shape of the phase 2 plot is easily explainable: With 10 to 10,000 hits, all shards must provide counts for nearly the full set of terms. When we get above 100,000, the chances of any shard having already delivered the count for part of the top X terms rises; when it has already been delivered, it will not be requested again in the second phase so the workload gets smaller.

So what to do about that? The easiest thing is to skip the second phase completely. That would give us the great response times instantaneous at the cost of precision: The counts for the individual terms in the facet might be a bit off. But maybe there is a way to get the fast speed and the precise counts? The full facet counting out in the shards in the first phase was quite a lot faster, so if we do that again (or cache the old result), we have counts for all terms in the facet. For each specified term, we could get its ordinal (for example by binary search, meaning 36 lookups in our concrete ordinal-term map) and with that, we could get the count directly. Ordinal-term lookups are somewhat heavy as the index data on storage needs to be accessed, so it remains to be seen if this way of handling the second phase is faster than the standard one. Time to code again.

Whale hunting with Solr

August 13, 2014 by

Our web archive index passed the 10TB mark a few days ago, so it was time for new performance measurements. To recap: 12 shards @ 900 GB, a total of 10.7TB or 3.6 billion documents. Served from a single 256GB machine with a 1 TB SSD dedicated for each shard.

We started by simple sequential searches for random words from a Danish dictionary. No faceting or other fancy stuff. The requests were for top-10 documents with their stored content. We measured the full processing-time (i.e. HTTP-get) and got this:

10.7TB, 3.6B, simple searches

Non-faceting searches on a 10.7TB, 3.6B document Solr index

We call it the whale and we have been a bit obsessed with it, since we discovered it 3 months ago when we saw it with 4 shards. Response times for 100 to 1 million hits are great, but what happens with the response times around 10 hits!? Inspection of the Solr logs showed nothing suspicious: Solr’s reported processing time (QTime) for the individual shards were 1 or 2 ms for the queries in question, while QTime for the merging Solr instance was 20-50 ms. Those are fine numbers.

Some of the queries with 10 hits were “blotlægge gøglertrup”, “eksponeringstid attestering” and “fuldkost hofleverandør” (quite funny in Danish actually; remember the words were selected at random). Those searches all took around 500 ms, measured from the outside of Solr, with reported QTimes below 50 ms. Could it be a HTTP-pequliarity, as Mikkel Kamstrup suggested? Diving into the concrete responses illuminated us.

Simple queries with very few hits in a large corpus happens because the query terms rarely occur in the same document. So which documents has a high chance of co-occurrence of random words from the dictionary? A dictionary of course! In a (hopefully vain) attempt of “search engine optimization”, some Danish web pages has embedded a dictionary below the real content (assumedly hidden by making the font color the same as the background or something like that). Normally such pages are ranked low due to the magic of Lucene/Solr, but with very few hits, they still become part of the search result.

So, statistically speaking, searches with few results gives us huge pages. Internally in Solr they are still processed quite fast (hence the fine QTime-numbers), but serializing the result to XML is not a light task, when the result is measured in megabytes. Had we just requested a few fields, such as URL and content_type, there would have been no hiccup. But we requested everything stored, including content_text. If we just request 15 limited-length fields for each documents and repeat the test, we get this:

10.7TB 3.6B

Non-faceting searches on a 10.7TB, 3.6B document Solr index, limited fields returned

Now that was strange. We got rid of the hump back, but overall the performance suffered? Does it take more time to ask for specific stored fields instead of all? Still, response times below 100 ms for the majority of searches is quite acceptable. Mystery considered solved!

Terabyte index, search and faceting with Solr

June 17, 2014 by

Vision, data & index building

Providing responsive freetext search and display with faceting and grouping for the full Danish net archive, for a limited number of researchers. The data in the net archive has been harvested from *.dk-domains using the Heritrix harvester since 2005 and currently amounts to 450TB, with approximately 10 billion entities.

Indexing is done with netarchive/netsearch, developed primarily by Thomas Egense and based on UKWA Webarchive Discovery: A central service keeps track of ARC-files, X controllers requests the path for ARC-files and keeps Y workers running. Each worker uses Tika to analyse the given ARC-file and sends the generated Solr documents to a Solr instance, specified by its controller. When the wanted index size is reached (900GB in our instance), the index is optimized down to a single segment and pushed to the search server.

Currently indexing is done on a single 24 core (48 with HT) Linux machine with 256GB RAM and 7TB SSD in RAID 0, running all parts of the indexing workflow. Sweet spot for that machine is 40 workers and 1 Solr, resulting in 90%+ CPU usage, primarily used by the Tika workers. It takes about 8 days to build one 900GB index. As of 2014-06-17, 4 such indexes has been build.

The indexing machine is not very well balanced with way too much RAM: Each worker runs fine with 1GB, Solr takes 32GB in order to handle merging down to a single 900GB index; 80GB would be enough. The SSDs in RAID 0 also have too much free space; 3-4TB would work fine with room for 2 shards. We expect the machine to be used for other jobs when the full indexing has finished and we switch to a much lighter day-to-day index update.

Search architecture

A single rack-mounted Linux server is responsible for handling the full search load. It is an 16 core (32 with HT) machine with 256GB RAM and 2 disk controllers with a total of 24 1TB commodity Samsung 840 SSDs, each mounted individually, each holding a separate index, each handled by a separate Solr instance. Distributed search is done with SolrCloud. The total cost for the search hardware is < €20K.

Search in the web archive is not high-availability – we accept that there can be downtime. Should a SSD fail, search will be unavailable until a new one has been installed and its index restored from backup. We are looking into using the backup files for failed drives directly from the backup storage as a temporary measure until the drives are ready again, but that is only at the investigation stage.

Basic search performance

At the time of testing, the index consists of 4 shards @ 900GB for a total of 3.6TB index data with 1.2 billion documents. Each Solr instance (one per shard) has 8GB of heap. As the machine is build for 20-24 shards, the index data represents just 1/6th of the expected final size. This leaves the machine vastly overpowered in its current state, with a surplus of CPUs and ~220GB of RAM for disk caching.

How overpowered? We tested back when the index was only 3 shards for a total of 2.7TB: User issued queries are handled with edismax on 6 fields and 1 phrase query on the catch-all field, a max of 20 returned documents with 10-15 stored fields of varying size. We tried hitting the server with just a single thread:

1 thread, 3 shards, 2.7TB, random words, no faceting

256GB RAM, 1 thread, 3 shards, 2.7TB, random words, no faceting

Response times way below 100ms when the number of hits are below 1 million, better than linear scaling above that? On an unwarmed index? Time to up the ante! What about 20 threads, this time on 4 shards for a total of 3.6TB?

20 threads, 4 shards, 3.6TB, random words, no faceting

256GB RAM, 20 threads, 4 shards, 3.6TB, random words, no faceting

It looks a bit like a whale and with 20K points, it is very hard to make sense of. Time to introduce another way of visualizing the same data:

256GB RAM, 20 threads, 4 shards, 3.6TB, random words, no faceting, percentile plot

256GB RAM, 20 threads, 4 shards, 3.6TB, random words, no faceting

This is a Box and Whisker plot, showing the quartiles as well as the min and max response times. The measurements are bucketed with 1-9 hits in the first bucket, 10-99 hits in the next and so forth. Again the magic point seems to be around 1M hits before performance begins to drop. The throughput was 66 searches/second. Repeating the search with 40 threads resulted in about the same throughput and about a doubling of the response times, which indicates that the 16 CPUs is the bottleneck.

Now, the Danish web archive is not Google. Due to legislation, the number of concurrent users will normally be 1 and searches will involve statistics and drill-downs, primarily meaning facets. While very impressive, the measurements above are really not representative of the expected use scenario. Time to tweak the knobs again.

Faceting on high-cardinality fields

For the end-scenario, we plan on faceting on 6 fields. One of them is the URL of the harvested resource, with nearly 1 unique value per resource. That means around 300 million unique values per shard, with 1.2 billion in the current 4 shard index and an estimated 7 billion in the final 24 shard index.

Normally it would seem rather unrealistic to facet on 300M+ documents with nearly as many unique values with 8GB of heap (the allocation for each Solr instance), but there are several things that helps us here:

  • The URL-field is single value, meaning a smaller and faster internal faceting structure
  • Each index is single segment, so no need to maintain a global-ords-mapping structure, fully skipping this normally costly memory overhead
  • DocValues works really well with high-cardinality fields, meaning low memory overhead

For this experiment we switched back to single threaded requests, but added faceting on the URL field. To make this slightly more representative of the expected final setup we also lowered the amount of RAM to 80GB. This left 40GB- for disk caching of the 3.6TB index data, or about 1%.

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr faceting on URL

500-1200ms for a search on 3.6TB with very high-cardinality faceting. Nice. But, how come the response time never gets below 500ms? This is due to a technicality in Solr faceting where it iterates counters for all potential facet terms (1.2 billion in this case), not just the ones that are actually updated. A more thorough explanation as well as a solution can be found in the blog post on Sparse Faceting. Let’s see a graph with both Solr standard and sparse faceting:

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

Or viewed as a Box and Whiskers plot, for sparse faceting only:

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

A quite peculiar looking development of response times. Still, looking at the whale graph at the beginning of this post, they do seem somewhat familiar. This is definitely an experiment that could benefit from a re-run with more sample points. Anyway, notice how even searches with 10M hits are faceted in less than 800ms.

Conclusion

So far our setup for search in the Danish web archive looks very promising. We have showed that searching with faceting on very high-cardinality fields can be achieved with acceptable (< 1 second in our case) response times on relatively cheap hardware. We will continue testing as the index grows and adjust our hardware, should it prove inadequate for the full corpus.

Update 2014-10-11

We have filled 19 out of the 25 SSDs for a total of 17TB or 5.6 billion documents. We have also isolated and partly solved the performance drop at 1000-100,000 hits that is visible in the graphs above; see ten times faster for details. Overall performance remains well within expectations.

Our biggest miscalculation seems to have been for the total size of the the full web archive index: 25 SSDs will not be enough to hold it all. Adding another row of 25 SSDs to the single machine seems unrealistic as the current bottleneck is CPU & RAM-speed, not storage I/O. The current plan is to scale out by buying 1 or 2 extra machines with the same specs as the current.

Sparse facet counting without the downsides

April 4, 2014 by

The SOLR-5894 issue “Speed up high-cardinality facets with sparse counters” is close to being functionally complete (facet.method=fcs and facet.sort=index still pending). This post explains the different tricks used in the implementation and their impact on performance.

Baseline

Most of the different Solr faceting methods (fc & fcs; with and without doc-values; single- and multi-value) uses the same overall principle for counting tag occurrences in facets:

  1. Allocate one or more counters of total size #unique_tags
  2. Fill the counters by iterating a hit queue (normally a bitmap) and getting corresponding counter indexes from a mapper
  3. Extract top-x tags with highest count by iterating all counters

There are 3 problems with this 3 step process: Allocation of a (potentially large) structure from memory, iteration of a bitmap with #total_documents entries and iteration of a counter with #unique_tags. Ideally this would be no allocation, iteration of just the IDs of the matched documents and iteration of just the tag counters that were updated. Sparse facet counting solves 2 out of the 3 problems.

Sparse

In this context sparse is seen as performance-enhancing, not space-reducing. SOLR-5894 solves the extraction time problem by keeping track of which counters are updated. With this information, the extraction process no longer needs to visit all counters.  A detailed explanation can be found at fast-faceting-with-high-cardinality-and-small-result-set. However, there are some peculiarities to sparse tracking that must be considered.

Processing overhead

Naive sparse faceting

The black line is Solr field faceting on a multi-valued field (3 values/document), the red line is the sparse implementation on the same field. When the result set is small, sparse processing time is markedly lower than standard, but when the result set is > 10% of all documents, it becomes slower. When the result set reaches 50%, sparse takes twice as long as standard.

This makes sense when one consider that both updating and extraction of a single counter has more processing overhead for sparse: When the number of hits rises, the accumulated overhead gets bad.

Maximum sparse size

Okay, so tracking does not make much sense past a given point. Besides, having a tracker the size of the counters themselves (100% overhead) seems a bit wasteful. Fixing the tracker size to the cross-over-point is the way to go. We choose 8% here. Thanks to the beauty of the tracking mechanism, exceeding the tracker capacity does not invalidate the collected results, it just means a logical switch to non-track-mode.

8% tracker

No doubt about where the sparse counter switches to non-sparse mode. Note how the distance from Solr standard (black line) to sparse with tracker-overflow (red line past 8%) is near-constant: Up until 8% there is an overhead for updating the tracker. When the tracker has overflowed that overhead disappears for the rest of the counter updates, but the cycles used for tracking up to that point are wasted.

Selection strategy

So memory overhead was reduced to 8% and performance was better for the very high hit counts, but still quite a bit worse than Solr standard. If only we could foresee if the sparse tracker would be overflowed or not.

We cannot determine 100% whether the tracker will be blown or not (at least not in the general case), but we can guess. Under the assumption that the references from documents to tags are fairly uniformly distributed, we can use the hit count (which we know when we start facet calculation) to guess whether the number of updated tag counts will exceed the tracker capacity.

Sparse bad guesses for cut-off

The chart demonstrates how bad guessing of the result size affects performance. The conservative guess (red line) means that many of the faceting calls are done by falling back to standard Solr and that the sparse speed-up is wasted. The optimistic guess (cyan line) has a higher risk of failed sparse-attempts, which means bad performance. In this example, the bad guess was around 10%. Still, even with such hiccups, the overall positive effect of using sparse counting is clear.

Good guessing

The best cut-off point for sparse/non-sparse attempt depends on the corpus and the searches, as well as the willingness to risk increased response times. Properly tuned and with a corpus without extreme outliers (such as a single very popular document referencing 10% of all tags), the result will be very satisfying.

Sparse good guess

For the low price of 8% memory overhead we get much better performance for small result sets and no penalty for larger result sets (under the assumption of correct guessing).

Counter allocation

Doing a little instrumentation it becomes clear that it is by no means free just to allocate a new counter structure with each facet call and throw it away after use. In the example above, 5M*3*4byte = 60MB are used for a counter. With a 2GB heap and otherwise non-tweaked execution of Solr’s start.jar, the average time used to allocate the 60MB was 13ms!

An alternative strategy is to keep a pool of counters and re-use them. This means that counters must be cleared after use, but this is markedly faster than allocating new ones. Furthermore this can be done by a background thread, so that the client can get the response immediately after the extraction phase. Enabling this, the picture gets even better.

Sparse everything

For very small result sets there is virtually no performance penalty for faceting.


Follow

Get every new post delivered to your Inbox.