Sparse facet counting on a real index

by

It was time for a little (nearly) real-world testing of a sparse facet counter for Lucene/solr (see Fast faceting with high cardinality and small result set for details). The first results are both very promising and somewhat puzzling.

The corpus was a copy of our production 50GB, 11M document index with library resources. The queries were taken randomly from the daily query log. Faceting was limited to just the author field, which has 7M unique values. The searcher was warmed up with hundreds of queries before testing. The tester ran with 2 threads against a 4 core machine and 500 searches were performed for each implementation.

Solr vs. exposed vs. sparse

In the graph, solr is standard Solr field faceting, exposed is our home brew (SOLR-2412) and sparse is our experimental home brew with sparse counting for small result sets. The red horizontal lines represents quartiles, with the max being replaced with the 95% for better graph scale. The horizontal black lines are medians.

The promising part is that the sparse counter has a much better median (16ms) than both solr (32ms) and exposed (29ms). Looking at the returned results, it seems clear that the vast majority of the queries only hits a fairly small part of the index, which benefits the sparse implementation. As they are real world queries, this is good news.

The performance of Solr field faceting and exposed is fairly similar, which is not very surprising as they work quite alike in this scenario. What is puzzling is that the maximum response time for both exposed and sparse is higher than solr‘s. The slowest response times not shown are 458ms for “have” with solr, 780ms for “to be or not to be” with exposed and 570ms for “new perspectives on native north america” with sparse. More testing is needed to determine if these are fluke results or if there is a general problem with worse outliers for exposed and sparse.

Update 20140320

Randomizing queries make for great experimentation but poor comparisons of implementations. Fixing the order and number of queries tested (29086, should anyone wonder) resulted in measurements without the strange outliers. The measurements were done in order exposed, sparse, packed, solr and nofacet. Maximum response time were a little above 2 seconds for all the facet calls and in all cases caused by the query ‘a‘.

Sparce faceting, fixed query order and count

Update 20140321

Introducing the JIRA issue SOLR-5894, with an accompanying patch for Solr 4.6.1. The patch only handles field cache based faceting on multi-valued fields right now, but that limitation was mainly to keep things simple. The sparse code is non-invasive and fits well within the way Solr performs field faceting. A quick experiment with 1852 fully warmed queries gave this:

author_7M_tags_1852_logged_queries_warmed

Update 20140324

Whoops. Forgot to include the baseline no-facets numbers. This changes the picture quite a bit. With a baseline median of 12 ms, sparse faceting overhead is only (24 ms – 12 ms) 12 ms and non-sparse is (36 ms – 12 ms) = 24 ms, which suspiciously (I triple checked the numbers) makes the sparse faceting overhead exactly half of non-sparse.

author_7M_tags_1852_logged_queries_warmed_with_base

About these ads

2 Responses to “Sparse facet counting on a real index”

  1. Sparse facet counting on a web archive | Software Development at Statsbiblioteket Says:

    […] A peekhole into the life of the software development department at the State Library of Denmark « Sparse facet counting on a real index […]

  2. Terabyte index, search and faceting with Solr | Software Development at Statsbiblioteket Says:

    […] updated. A more thorough explanation as well as a solution can be found in the blog post on Sparse Faceting. Let’s see a graph with both Solr standard and sparse […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: