What is high cardinality anyway?

An attempt to explain sparse faceting and when to use it in not-too-technical terms. Sparse faceting in Solr is all about speeding up faceting on high-cardinality fields for small result sets. That’s a clear statement, right? Of course not. What is high, what is small and what does cardinality mean? Dmitry Kan has spend a lot of time testing sparse faceting with his high-cardinality field, without getting the promised performance increase. Besides unearthing a couple of bugs with sparse faceting, his work made it clear that there is a need for better documentation. Independent testing for the win!

What is faceting?

When we say faceting in this context, it means performing a search and getting a list of terms for a given field. The terms are ordered by their count, which is the number of times they are referenced by the documents in the search result. A classic example is a list of authors:

  The search for "fairy tale" gave 15 results «hits».

  Author «field»
   - H.C. Andersen «term» (12 «count»)
   - Brothers Grimm «term» (5 «count»)
   - Lewis Carroll «term» (3 «count»)

Note how the counts sums up to more than the number of documents: A document can have more than one reference to terms in the facet field. It can also have 0 references, all depending on the concrete index. In this case, there are either more terms than are shown or some of the documents have more than one author . There are other forms of faceting, but they will not be discussed here.

Under the hood

At the abstract level, faceting in Solr is quite simple:

  1. A list of counters is initialized. It has one counter for each unique term in the facet field in the full corpus.
  2. All documents in the result set are iterated. For each document, a list of its references to terms is fetched.
    1. The references are iterated and for each one, the counter corresponding to its term is increased by 1.
  3. The counters are iterated and the Top-X terms are located.
  4. The actual Strings for the located terms are resolved from the index.

Sparse faceting improves on standard Solr in two ways:

  • Standard Solr allocates a new list of counters in step 1 for each call, while sparse re-uses old lists.
  • Standard Solr iterates all the counters in step 3, while sparse only iterates the ones that were updated in step 2.

Distributed search is different

Distributed faceting in Solr adds a few steps:

  • All shards are issued the same request by a coordinating Solr. They perform step 1-4 above and returns the results to the coordinator.
  • The coordinator merges the shard-responses into one structure and extracts the Top-X terms from that.
  • For each Top-X term, its exact count is requested from the shards that did not deliver it as part of step a.

Standard Solr handles each exact-count separately by performing a mini-search for the term in the field. Sparse reuses the filled counters from step 2 (or repeats step 1-2 if the counter has been flushed from the cache) and simply locates the counters corresponding to the terms. Depending on the number of terms, sparse is much faster (think 5-10x) than standard Solr for this task. See Ten times faster for details.

What is cardinality?

Down to earth, cardinality just means how many there are of something. But what thing? The possibilities for faceting are many: Documents, fields, references and terms. To make matters worse, references and terms can be counted for the full corpus as well as just the search result.

  • Performance of standard Solr faceting is linear to the number of unique terms in the corpus in step 1 & 3 and linear to the number of references in the search result in step 2.
  • Performance of sparse faceting is (nearly) independent of the number of unique terms in the corpus and linear to the number of references in the search result in step 2 & 3.

Both standard Solr and sparse treats each field individually, so they both scale linear for that. The documents returned as part of base search are represented in a sparse structure itself (independent of sparse faceting) and scales with result set size. While it does take time to iterate over these documents, this is normally dwarfed by the other processing steps. Ignoring the devils in the details: Standard Solr facet performance scales with the full corpus size as well as the result size, while sparse faceting scales just with the result size.

Examples please!

  • For faceting on URL in the Danish Web Archive, cardinality is very high for documents (5bn), references (5bn) and terms (5bn) in the corpus. The overhead of performing a standard Solr faceting call is huge (hundreds of milliseconds), due to the high number of terms in the corpus. As the typical search results are quite a lot smaller than the full corpus, sparse faceting is very fast.
  • For faceting on host in the Danish Web Archive, cardinality is very high for documents (5bn) and references (5bn) in the corpus. However, the number of  terms (1.3m) is more modest. The overhead of performing a standard Solr faceting call is quite small (a few milliseconds), due to the modest number of terms; the time used in step 2, which is linear to the references, is often much higher than the overhead. Sparse faceting is still faster in most cases, but only by a few milliseconds. Not much if the total response time is hundreds of milliseconds.
  • For faceting on content_type_norm in the Danish Web Archive, cardinality is very high for documents (5bn) and references (5bn) in the corpus. It is extremely small for the number of unique terms, which is 10. The overhead of performing a standard Solr faceting call is practically zero; the time used in step 2, which is linear to the references, is often much higher than the overhead. Sparse faceting is never faster than Solr for this and as a consequence falls back to standard counting, making it perform at the same speed.
  • For faceting on author at the library index at Statsbiblioteket, cardinality is high for documents (15m), references (40m) and terms (9m) in the corpus. The overhead of performing a standard Solr faceting call is noticeable (tens of milliseconds), due to the 9m terms in the corpus. The typical search results is well below 8% of the full corpus, and sparse faceting is markedly faster than standard Solr. See Small Scale Sparse Faceting for details.

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, Faceting, Low-level, Performance, Solr. Bookmark the permalink.

One Response to What is high cardinality anyway?

  1. Craig says:

    Thanks for this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s