It was time for a little (nearly) real-world testing of a sparse facet counter for Lucene/solr (see Fast faceting with high cardinality and small result set for details). The first results are both very promising and somewhat puzzling.
The corpus was a copy of our production 50GB, 11M document index with library resources. The queries were taken randomly from the daily query log. Faceting was limited to just the author field, which has 7M unique values. The searcher was warmed up with hundreds of queries before testing. The tester ran with 2 threads against a 4 core machine and 500 searches were performed for each implementation.
In the graph, solr is standard Solr field faceting, exposed is our home brew (SOLR-2412) and sparse is our experimental home brew with sparse counting for small result sets. The red horizontal lines represents quartiles, with the max being replaced with the 95% for better graph scale. The horizontal black lines are medians.
The promising part is that the sparse counter has a much better median (16ms) than both solr (32ms) and exposed (29ms). Looking at the returned results, it seems clear that the vast majority of the queries only hits a fairly small part of the index, which benefits the sparse implementation. As they are real world queries, this is good news.
The performance of Solr field faceting and exposed is fairly similar, which is not very surprising as they work quite alike in this scenario. What is puzzling is that the maximum response time for both exposed and sparse is higher than solr‘s. The slowest response times not shown are 458ms for “have” with solr, 780ms for “to be or not to be” with exposed and 570ms for “new perspectives on native north america” with sparse. More testing is needed to determine if these are fluke results or if there is a general problem with worse outliers for exposed and sparse.
Randomizing queries make for great experimentation but poor comparisons of implementations. Fixing the order and number of queries tested (29086, should anyone wonder) resulted in measurements without the strange outliers. The measurements were done in order exposed, sparse, packed, solr and nofacet. Maximum response time were a little above 2 seconds for all the facet calls and in all cases caused by the query ‘a‘.
Introducing the JIRA issue SOLR-5894, with an accompanying patch for Solr 4.6.1. The patch only handles field cache based faceting on multi-valued fields right now, but that limitation was mainly to keep things simple. The sparse code is non-invasive and fits well within the way Solr performs field faceting. A quick experiment with 1852 fully warmed queries gave this:
Whoops. Forgot to include the baseline no-facets numbers. This changes the picture quite a bit. With a baseline median of 12 ms, sparse faceting overhead is only (24 ms – 12 ms) 12 ms and non-sparse is (36 ms – 12 ms) = 24 ms, which suspiciously (I triple checked the numbers) makes the sparse faceting overhead exactly half of non-sparse.