Small scale sparse faceting

While sparse faceting has profound effect on response time in our web-archive, we are a bit doubtful about the amount of multi billion document Solr indexes out there. Luckily we also have our core index at Statsbiblioteket, which should be a bit more representative of your everyday Solr installation: Single-shard, 50GB, 14M documents. The bulk of the traffic are user-issued queries, which involves spellcheck, edismax qf & pf on 30+ fields and faceting on 8 fields. In this context, the faceting is of course the focus.

Of the 8 facet fields, 6 are low-cardinality and 2 are high-cardinality. Sparse was very recently enabled for the 2 high-cardinality ones, namely subject (4M unique values, 51M instances (note to self: 51M!? How did it get so high?)) and author (9M unique values, 40M instances).

To get representative measurements, the logged response times were extracted for the hours 07-22; there’s maintenance going on at night and it skews the numbers. Only user-entered searches with faceting were considered. To compare before- and after sparse-enabling, the data for this Tuesday and last Tuesday were used.

50GB / 14M docs, logged QTimes from production, without (20140902) and with (20140909) sparse faceting

50GB / 14M docs, logged timing from production, without (20140902) and with (20140909) sparse faceting

The performance improvement is palpable with response time being halved, compared to the non-sparse faceting. Fine-reading the logs, the time spend on faceting the high-cardinality fields is now in the single-digit milliseconds for nearly all queries. We’ll have to do some test to see what stops the total response time from getting down to that level. I am guessing spellcheck.

As always, sparse faceting is readily available for the adventurous at SOLR-5894.

Update 20140911

To verify that last Tuesday was not a lucky shot, here’s the numbers for the last 4 Wednesdays. Note that the amount of queries/day is fairly low for the first two weeks. This is due to semester start. Also note that the 10^8 hits (basically the full document set) were removed as those were all due to the same query being repeated by a dashboard tool.

50GB / 14M docs, logged timing from production. Only 20140909 is with sparse faceting

50GB / 14M docs, logged timing from production. Only 20140909 is with sparse faceting

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, Faceting, Performance, Solr, Uncategorized. Bookmark the permalink.

5 Responses to Small scale sparse faceting

  1. Dmitry says:

    Toke,

    What are the facet.limit and facet.mincount you are using for these tests?

  2. Toke Eskildsen says:

    15 for the two high-cardinality ones and in the same ball-park for the rest. We did some testing and a long as we stayed in the lower hundreds, performance is fine. But this is with non-distributed faceting!

    Andy Jackson encountered the secondary phase performance problem for high limits: https://twitter.com/anjacks0n/status/509284768035262464

    I’ll see if I can find the time to do a test with variable limits on our web-archive to see what the penalty is with distributed sparse faceting.

  3. Dmitry says:

    Thanks!

    I’m quite curious about the limits in hundreds and distributed case.

    What also bugs me is the facet pagination. I sense it needs to recompute entire set and then offset in it on every page, like in the case of paginating with QueryComponent before hoss introduced fast pagination.

    Did you have a chance / curiosity to experiment with multithreaded faceting? https://issues.apache.org/jira/browse/SOLR-2548

  4. Toke Eskildsen says:

    I have done a few tests up to facet.limit=100 at https://sbdevel.wordpress.com/2014/09/11/even-sparse-faceting-is-limited/ and I plan to update later today with measurements for 1K, 10K & 100K: I had to increase the runtime for each test to 1 hour to get proper sampling for Solr fc faceting.

    I have not experimented with pagination, but I am fairly sure that it works the simple way by issuing progressively larger facet.limits from the shards. It seems that the principle of Hoss’ fast pagination can also be applied to faceting, but that also seems like quite a lot of work. This is independent of sparse faceting and could be a fun future project. Not a very high-priority one though, as we have no current use for the functionality.

    Multithreaded faceting should be simple to test for our web-archive and simple to enable for our core index (the one described above). Currently we have 14 cores & 16 CPUs on the web-machine and 1 core & 4 CPUs on the core index machines. Both settings should benefit from multithreaded faceting when they are not being hammered by concurrent searches. I’ll see if I can find the time to test it.

  5. Pingback: What is high cardinality anyway? | Software Development at Statsbiblioteket

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s