For most searches in our Net Archive, we have acceptable response time, due to the use of sparse faceting with Solr. Unfortunately as well as expectedly, some of the searches are slow. Response times in minutes slow, if we’re talking worst case. It is tied to the number of hits: Getting top-25 most popular links from pages about hedgehogs will take a few hundred milliseconds. Getting the top-25 links from all pages from 2010 takes minutes. Visualised, the response times looks like this:
Everything beyond 1M hits is slow, everything beyond 10M hits is coffee time. Okay for batch analysis, but we’re aiming for interactive use.
Get the probably correct top-X terms by sampling
Getting the top-X terms for a given facet can be achieved by sampling: Instead of processing all hits in the result set, some of them are skipped. The result set iterator conveniently provides an efficient advance-method, making this very easy. As we will only use sampling with larger result sets, there should be enough data to be quite sure that the top-25 terms are the correct ones, although their counts are somewhat off.
This of course all depends on how high X is in top-X, concrete corpus etc. The biggest danger is clusters of content in the corpus, which might be skipped. Maybe the skipping could be made in small steps? Process 100 documents, skip 500, process the next 100…? Tests will have to be made.
Fine count the top-X terms
With the correct terms being isolated, precisely those term can be fine counted. This is nearly the same as vanilla distributed faceting, with the exception that all shards must fine count all the top-X terms, instead of only the terms they had not already processed earlier.
Of course the fine counting could be skipped altogether, which would be faster and potentially very usable for interactive exploratory use, where the exact counts does not really matter.
But there’s no guarantee?
No. Do remember that vanilla Solr distributed faceting is also a best-effort, with the same guarantee as above: The terms are not guaranteed to be the correct ones, but their counts are.
Seems simple enough
Ticket #38 for sparse faceting has been opened and we could really use this in the Danish Net Archive Search. No promises though.
Knut Anton Bøckman mentioned on Twitter that Primo has a faceting mechanism that looks similar to my proposal. It seems that Primo uses the top-200 hits to select the facets (or rather terms?), then do a fine-count on those.
It might work well to base the term selection on the top hits, rather than sampling randomly through all the hits, but I am afraid that 200 is so small a sample that some of the terms will differ from the right ones. I understand the need for a small number though: Getting the top-million hits or just top-hundred-thousand is costly.