While our Net Archive search performs satisfactory, we would like to determine how well-balanced the machine is. To recap, it has 16 cores (32 with Hyper Threading), 256GB RAM and 25 Solr shards @ 900GB. When running it uses about 150GB for Solr itself, leaving 100GB memory for disk cache.
Test searches are for 1-3 random words from a Danish dictionary, with faceting on 1 very large (billions of unique values, billions of references), 2 large fields (millions of unique values, billions of references) and 3 smaller fields (thousands of unique values, billions of references). Unless otherwise noted, searches were issued one request at a time.
Under Linux it is quite easy to control which cores a process utilizes, by using the command taskset. We tried scaling that by doing the following with different cores:
- Shut down all Solr instances
- Clear the disk cache
- Start up all Solr instances, limited to specific cores
- Run the standard performance test
In the chart below, ht means that there is the stated number of cores + their Hyper Threaded counterpart. In other words, 8ht means 8 physical cores but 16 virtual ones.
- Hyper Threading does provide a substantial speed boost.
- The differences between 8ht cores and 16 or 16ht cores are not very big.
Conclusion: For standard single searches, which is the design scenario, 16 cores seems to be overkill. More complex queries would likely raise the need for CPU though.
Changing the number of shards on the SolrCloud setup was simulated by restricting queries to run on specific shards, using the argument shards. This was not the best test as it measured the combined effect of the shard-limitation and the percentage of the index held in disk cache; e.g. limiting the query to shard 1 & 2 meant that about 50GB of memory would be used for disk cache per shard, while limiting to shard 1, 2, 3, & 4 meant only 25GB of disk cache per shard.
Note: These tests were done on performance degraded drives, so the actual response times are too high. The relative difference should be representative enough.
- Performance for 1-8 shards is remarkably similar.
- Going from 8 to 16 shards is 100% more data at half performance.
- Going from 16 to 24 shards is only 50% more data, but also halves performance.
Conclusion: Raising the number of shards further on an otherwise unchanged machine would likely degrade performance fast. A new machine seems like the best way to increase capacity, the less guaranteed alternative being more RAM.
Scaling disk cache
A Java program was used to reserve part of the free memory, by allocating a given amount of memory as long and randomly changing the content. This effectively controlled the amount of memory available for disk cache for the Solr instances. The Solr instances were restarted and the disk cache cleared between each test.
- A maximum running time of 10 minutes was far too little for this test, leaving very few measuring points for 54GB, 27GB and 7GB disk cache.
- Performance degrades exponentially when the amount of disk cache drops below 100GB.
Conclusion: While 110GB (0.51% of the index size) memory for disk cache delivers performance well within our requirements, it seems that we cannot use much of the free memory for other purposes. It would be interesting to see how much performance would increase with even more free memory, for example by temporarily reducing the number of shards.
Scaling concurrent requests
Due to limited access, we only need acceptable performance for one search at a time. Due to the high cardinality (~6 billion unique values) URL-field, the memory requirements for a facet call is approximately 10GB, severely limiting the maximum number of concurrent requests. Nevertheless, it is interesting to see how much performance changes when the number of concurrent requests rises. To avoid reducing the disk cache, we only tested with 1 and 2 concurrent requests.
Observations (over multiple runs; only one run in shown in the graph):
- For searches with small to medium result sets (aka “normal” searches), performance for 2 concurrent requests was nearly twice as bad as for 1 request.
- For searches with large result sets, performance for 2 concurrent requests were more than twice as bad as for 1 request. This is surprising as a slightly better than linear performance drop were expected.
Conclusion: Further tests seems to be in order due to the surprisingly bad scaling. One possible explanation is that memory speed is the bottleneck. Limiting that number to 1 or 2 and queuing further requests is the best option for maintaining a stable system due to memory overhead.
Admittedly, the whole facet-on-URL-thing might not be essential for the user experience. If we avoid faceting on that field and only facet on the more sane fields, such as host, domain and 3 smaller fields, we can turn up the number of concurrent requests without negative impact on disk cache.
- Mean performance with 1 concurrent request is 10 times better, compared to the full faceting scenario.
- From 1-4 threads, latency increases (bad) and throughput improves (good).
- From 4-32 threads, latency increases (bad) but throughput does not improve (double bad).
Conclusion: As throughput does not improve for more than 4 concurrent threads, using a limit of 4 seems beneficial. However, as we are planning to add faceting on links_domains and links_hosts, as well as grouping on URL, the measured performance is not fully representative of future use of search in the Net Archive.