There we were, minding other assignments and keeping a quarter of an eye on our web archive indexer and searcher. The indexer finished its 900GB Solr shard number 22 and the searcher was updated to a total of 19TB / 6 billion documents. With a bit more than 100GB free for disk cache (or about 1/2 percent of total index size), things were relatively unchanged, compared to ~120GB free a few shards ago. We expected no problems. As part of the index update, an empty Solr was created as entry-point, with a maximum of 3 concurrent connections, to guard against excessive memory use.
But something was off. User issued searches seemed slower. Quite a lot slower for some of them. Time for a routine performance test and comparison with old measurements.
As the graph shows very clearly, response times rose sharply with the number of hits in a search in our 19TB index. At first glance that seems natural, but as the article Ten times faster explains, this should be a bell curve, not an ever-upgoing hill. The bell curve can be seen for the old 12TB index. Besides, those new response times were horrible.
Investigating the logs showed that most of the time was spend resolving facet-terms for fine-counting. There are hundreds of those for the larger searches and the log said it took 70ms for each, neatly explaining total response times of 10 or 20 seconds. Again, this would not have been surprising if we were not used to much better numbers. See Even sparse faceting is limited for details.
A Systems guy turned off swap, then cleared the disk cache, as disk cache clearing has helped us before in similar puzzling situations. That did not help this time: Even non-faceted searches had outliers above 10 seconds, which is 10 times worse than with the 12TB index.
Due to unrelated circumstances, we then raised the number of concurrent connections for the entry-point-Solr from 3 to 50 and restarted all Solr instances.
Welcome back great performance! You were sorely missed. The spread as well as the average for the 19TB index is larger than its 12TB counter part, but not drastically so.
So what went wrong?
- Did the limiting of concurrent searches at the entry-Solr introduce a half-deadlock? That seems unlikely as the low-level logs showed the unusual high 70ms/term lookup-time, which is done without contact to other Solrs.
- Did the Solr-restart clean up OS-memory somehow, leading to better overall memory performance and/or disk caching?
- Were the Solrs somehow locked in a state with bad performance? Maybe a lot of garbage collection? Their Xmx is 8GB, which has been fine since the beginning: As each shard runs in a dedicated tomcat, the new shards should not influence the memory requirements of the Solrs handling the old ones.
We don’t know what went wrong and which action fixed it. If performance starts slipping again, we’ll focus on trying one thing at a time.
Why did we think clearing the disk cache might help?
It is normally advisable to use Huge Pages when running a large Solr server. Whenever a program requests memory from the operating system, this is done as pages. If the page size is small and the system has a lot of memory, there will be a lot of bookkeeping. It makes sense to use larger pages and have less bookkeeping.
Our indexing machine has 256GB of RAM, a single 32GB Solr instance and constantly starts new Tika processes. Each Tika process takes up to 1GB of RAM and runs for an average of 3 minutes. 40 of these are running at all times, so at least 10GB of fresh memory is requested from the operating system each minute.
We observed that the indexing speed of the machine fell markedly after some time, down to 1/4th of the initial speed. We also observed that most of the processing time was spend in kernel space (the %sy in a Linux top). Systems theorized that we had a case of OS memory fragmentation due to the huge pages. They tried flushing the disk cache (echo 3 >/proc/sys/vm/drop_caches) to reset part of the memory and performance restored.
A temporary fix of clearing the disk cache worked fine for the indexer, but the lasting solution for us was to disable the use of huge pages on that server.
The searcher got the same no-huge-pages treatment, which might have been a mistake. Contrary to the indexer, the searcher rarely allocates new memory and as such looks like an obvious candidate for using huge pages. Maybe our performance problems stemmed from too much bookkeeping of pages? Not fragmentation as such, but simply the size of the structures? But why would it help to free most of the memory and re-allocate it? Does size and complexity of the page-tracking structures increase with use, rather than being constant? Seems like we need to level up in Linux memory management.
Note: I strongly advice against using repeated disk cache flushing as a solution. It is symptom curing and introduces erratic search performance. But it can be very useful as a poking stick when troubleshooting.
On the subject of performance…
The astute reader will have noticed that the performance-difference is strange at the 10³ mark. This is because the top of the bell curve moves to the right as the number of shards increases. See Even sparse faceting is limited for details.
In order to make the performance comparison apples-to-apples, the no_cache numbers were used. Between the 12TB and the 19TB mark, sparse facet caching was added, providing a slight speed-up to distributed faceting. Let’s add that to the chart:
Although the index size was increased by 50%, sparse facet caching kept performance at the same level or better. It seem that our initial half-dismissal of the effectiveness of sparse facet caching was not fair. Now we just need to come up with similar software improvements each month and we we will never need to buy new hardware.
Do try this at home
If you want to try this on your own index, simply use sparse solr.war from GitHub.