Ten times slower

I jumped the gun on our current web index status with our whale hunting safari, but it turns out that there are other fish to kill (sorry, whales are not fish and I’ll stop with the nautical metaphors immediately). This time we’re talking faceting on billions of documents.

At Statsbiblioteket we are building a SolrCloud web index of our harvested material. 9 months ago that was 372 TB or 10 billion entities. It has probably passed 400 TB by now. Our insane plan was to index it, put the indexes on a single machine and to do faceted searches. Because it made sense and maybe a little because it is pure awesome to handle that amount of information from a single box. Read about our plans in the post Webscale in Danish.

It has been about three months since we last checked how things were with our searchable web archive at Statsbiblioteket. Back then it contained 4 shards for a total of 3.6 TB or 1.2 billion documents, on our single 16 core machine with 256 GB of RAM. Simple searches were fast and faceting on URL (nearly unique for every document) equally so, even when we lowered the amount of RAM to 80 GB, which meant that only 1% of the total index data could be cached in RAM. The two graphs below illustrates the state back then.

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

Sparse faceting is markedly better than stock Solr under our test; performance is very satisfactory with most response times well below 1 second. It is important to note that the distributed faceting calls are executed with a single thread in each shard for this setup. This means that only 4 CPUs were fully utilized at a time during a single faceted search.

Yesterday we reached 12 shards for a total of 10.7 TB of index data in 3.6 billion documents. Turning off the artificial RAM limitation left us with 140 GB of RAM for disk cache or 1.3% of the full index size. So more RAM for cashing than we had with 4 shards and still plenty of CPU power as the machine has 16 cores (*2 if you count HyperThreading). Of course the merging gets a little heavier, but not too much. In an ideal world this would mean that the performance would be unchanged, right?

No, we did not think so either. So how bad is it?

 

256GB RAM, 1 thread, 12 shards, 10TB, random words, Solr & sparse faceting on URL

256GB RAM, 1 thread, 12 shards, 10TB, random words, Solr & sparse faceting on URL

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL

Ouch. 2½ second median? That’s bad! But wait a minute… Doesn’t that percentile plot look like a whale? And how come this is so much slower than our 4 shard setup and how come it is faster to facet on 100,000 values than it is to facet on 10,000? Time to check the logs.

Distributed Solr faceting is two-phase. First phase is standard faceting (find the top-X facet terms for a given search). The merger then collects the results, sums them and extracts the collective top-X terms. Second phase is to ask the shards for the counts for those terms, to get the correct count as the terms might not have been returned from all shards in the first phase. The merger is smarter than that, but the principle holds.

It seems logical that second phase is faster than first phase: It just has to calculate counts for a limited amount of specified terms instead of performing full facet processing. Let’s go back to the logs and plot the response times for first and second phase separately. Note that these numbers are from the 12 Solr logs at shard-request level and not at the full distributed call level.

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

There goes that logical conclusion. The second phase takes more than 10 times as long as first phase! What is happening? We need to dig deeper and look at the surprisingly simple code for second phase:

private NamedList getListedTermCounts(String field, String termList, DocSet base) throws IOException {
  FieldType ft = searcher.getSchema().getFieldType(field);
  List<String> terms = StrUtils.splitSmart(termList, ",", true);
  NamedList<Integer> res = new NamedList<Integer>();
  for (String term : terms) {
    String internal = ft.toInternal(term);
   int count = searcher.numDocs(new TermQuery(new Term(field, internal)), base);
   res.add(term, count);
 }
 return res;
}

We have the result of the query in the bitset base; with shards of 300 million documents, that is a rather large bitset. For each term in the second phase facet request, a simple search is performed for facet_field:specific_term. This results in another bitset. The number of intersecting bits in these two sets is the count for the term.

The problem here is that we are doing intersections of very large bitsets. Potentially they can be represented by compact hashmaps or other structures, but the log tells us that this still takes quite a lot of time for a corpus of this size. Time that grows as the number of shards grows.

Guessing time: If at least one of the bitsets is a bitmap with 1 bit per document in the index, that takes up about 40 MB of heap, which is accessed when doing the intersection. If there are 20 terms in the request (quite likely as we ask for top-20 on URL), this is done 20 times. So a least 800MB of memory is accessed. With 12 shards doing faceting in parallel, this is just shy of 10 GB of memory. It is hard to measure memory wait time, but it seems a likely culprit that we are simple waiting for main memory.

The shape of the phase 2 plot is easily explainable: With 10 to 10,000 hits, all shards must provide counts for nearly the full set of terms. When we get above 100,000, the chances of any shard having already delivered the count for part of the top X terms rises; when it has already been delivered, it will not be requested again in the second phase so the workload gets smaller.

So what to do about that? The easiest thing is to skip the second phase completely. That would give us the great response times instantaneous at the cost of precision: The counts for the individual terms in the facet might be a bit off. But maybe there is a way to get the fast speed and the precise counts? The full facet counting out in the shards in the first phase was quite a lot faster, so if we do that again (or cache the old result), we have counts for all terms in the facet. For each specified term, we could get its ordinal (for example by binary search, meaning 36 lookups in our concrete ordinal-term map) and with that, we could get the count directly. Ordinal-term lookups are somewhat heavy as the index data on storage needs to be accessed, so it remains to be seen if this way of handling the second phase is faster than the standard one. Time to code again.

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, Faceting, Low-level, Performance, Solr. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s