It does count as web scale if you actually index the web, right? The danish part of the web at least, which is not at all easy to define. Nevertheless Netarkivet currently has 372 TB of harvested data, which translates to roughly 10 billion (10,000,000,000) indexable entities. We have indexed about 1/100th of this in Solr as a test and that gave us an index of about 200 GB; the full corpus should thus take about 20 TB. The question is what kind of hardware it takes to search it all?
Due to danish copyright and privacy laws, we cannot provide general access to the archive. This means that usage will be limited, which again means that the number of possible concurrent searches can be set to 1. Of course 2 people can search at the same time, but it is okay if that doubles the response times. We thus only need to look at latency – i.e. the time it takes from issuing a search until a result is received.
Faceting and grouping by URL would be very useful when searching the archive, but as the only current access possibility is “Give us the resource with this specific URL”, any search at all would be a great step forward.
Our knee-jerk reaction to hardware for search is to order Solid State Drives, but with a limited amount of users, 20 TB of SSDs might be overkill. Luckily we do have an Isilon system with some terabytes to spare for testing as well as production. The Isilon is networked storage and it is very easy to expand so it would be great if that could deliver the needed storage performance. Blade servers with virtual machines deliver the CPU horse power, but as we shall see it really is all about IOPS (In/Out Operations Per Second).
Building one single 20 TB index would be an intriguing endeavour but alas, only 2 billion documents/segment in Lucene/Solr. Besides, this would be quite devastating in case of index errors. Furthermore there is a high chance that it would provide lower performance compared to multiple indexes as a lot of the internal seeking operations for a single index are sequential.
Some sort of distribution seems to be the way forward and SolrCloud is first candidate (sorry, ElasticSearch, maybe next time?). We need to estimate what kind of iron it takes to handle the 20TB and we need to estimate a proper shard-count vs. shard-size ratio. Is 100 shards of 200GB better than 10 shards of 2TB? Since our content is ever-growing and never-updating we have the luxury of being able to optimize and freeze our shards as we build them one at a time. No frustrating “index time vs. search time”-prioritization here.
Real Soon Now, promise. First we need to determine the bottleneck for our box. CPU or IO? To do so we measure cloud sizes vs. number of concurrent searches. By taking the response time and dividing by (#threads * #shards), we get the average response time for the individual shards. If that number plateaus independently of #threads as well as #shards, we have the maximum performance for our box.
For query testing we do an OR-query with 2-4 random danish words and light faceting. This should be representative of the real-world queries, at least for IO. As we cannot flush the cache on the Isilon box (Operations would kill us if we managed to do so), each test is preceded by a short (~100 queries) warmup run and a hope that the numbers will be consistent. Collected statistics are based on Solr-reported Query-Times and collected for all threads. We checked and that information is very close to full response times.
Confused? No worries, it gets a lot clearer when we have numbers.
The Isilon system used spinning drives with a fair amount of combined SSD and RAM caching in front. Remember that reported times are
response_time/(#threads * #shards) ms.
Testing with 200GB (75M documents) indexes, Median (50% percentile), remote Isilon
Testing with 200GB (75M documents) single segment indexes, 99% percentile, remote Isilon
Testing with 420GB (174M documents) single segment indexes, Median (50% percentile), remote Isilon
Testing with 420GB (174M documents) single segment indexes, 99% percentile, remote Isilon
The average response time/shard/thread gets markedly better when going from 1 to 2 threads. This makes sense, as a single-thread is alternating between CPU-wait and IO-wait, thus not utilizing both resources fully. More threads does not improve performance by much and in a lot of cases, 32 threads are slower than 16 threads.
Our goal was to estimate the hardware requirements for the full 20TB index files. We see from the tables above that the average response times for the individual shards to grow with the number of shards and the number of threads, but not critically. The average of the medians for 8 shards of 200GB is 62 ms, while it is 73 ms for 16 shards; this is a 17% increase for double the amount of data. With a 20TB estimate, we would need 100 shards. Expecting about 20% increase for each doubling of the shard count, the average shard response time would be approximately 115 ms for a total of 11,500 ms.
This might be somewhat optimistic as the cache works significantly better on smaller datasets and because the average median is lower than the single thread median, which is more telling for the expected usage pattern.
Looking at the average median for 420GB shards, we have 59 ms for 4-5 shards and 66 ms for 9 shards, or an response time increase of only 11% for a doubling of the index size. Extrapolating from this, the full corpus of about 50 shards would have an average shard response time of approximately 85 ms for a total of 4,250 ms. As this is less than half of the estimated time for the 200GB shards, this seems suspiciously low (or the estimation for the 200GB shards is too high).
Those were medians. Repeating for 99 percentile we get 14,500 ms for 200GB shards and 5,500 ms for 420GB. Again, the 420GB measurements seems to be “too good”, compared to the 200GB ones.
We do not have terabytes of unused space on Solid State Drives lying around, but a temporary reorganizing on our test machine freed just above 400GB. Enough for testing 2 * 200GB. It is important to keep in mind that this storage is both local & SDD as opposed to the Isilon box which is remote & spinning drives.
Note that the two rows in each test are for the same number of shards (2). This was done to see whether warming/caching improved response times.
Testing with 200GB (75M documents) indexes, Median (50% percentile), local SSD
|Shards||1||2||4||8||16||32||avg||total GB||avg full|
Testing with 200GB (75M documents) single segment indexes, 99% percentile, local SSD
|Shards||1||2||4||8||16||32||avg||total GB||avg full|
As opposed to spinning drives, the performance continues to improve markedly with threading up to around 16 threads. Keeping in mind that this is on a 16 core machine, this tells us that we do not have enough data to be sure whether the SSDs or the CPUs are the main bottleneck.
Note to self: It would be interesting to try and limit the number of active CPUs by using taskset.
One thing we can see is that the SSD-solution seems to be much better performing than the spinning drives one. But how much is much?
Putting it all together
With only enough free SSD for 2*200GB shards, we are forced to project the final load by threading. Our maintenance guys very temporarily freed 400GB of SSD on the Isilon box so we were able to see whether the local vs. remote parameter had a lot of influence on performance.
Keep in mind that this is a very loose projection and especially keep in mind that bigger shards are relatively faster while bigger test data are relatively slower. Still, the conclusion seems very clear: Using SSDs as storage is a whole other level of performance, when only a small part of the index can be cached in RAM.
Sweet sweet hardware
We all know that SSDs are expensive. Except that they are not. Yes they cost 10 times as much as spinning drives when looking at capacity only, but when the whole point about SSDs is speed, it is plain stupid just to look at capacity.
As our analysis above shows, scaling to satisfactory response times (median < 1 second in this case) with spinning drives would require 5-10 times the speed we currently get from Isilon. As the Isilon box is already using RAID for speed improvement (and crash recovery), this seems like quite a task. Compensating for spinning drives by increasing the amount of RAM for caching is unrealistic with 20TB of index.
It is simply cheaper and faster to base our search machine on SSDs. We can handle crashes by mounting each drive separately and using shards the same size as each drive. If a drive fails, the corresponding shard backup (which will probably be located on Isilon) takes over until the shard has been copied to a new SSD. Our hardware guys suggests
- 1x 24×2.5″ Intel JBOD2000 2U RACK, Dual PSU, dual backplane
- 24x 1TB Samsung 840 EVO
- 1x Intel RS25GB008 JBOD controller with 8 external ports
- A server to hold the controller
Estimated price 100-200,000 danish kroner or 20-40,000 USD, depending on RAM (suggested amount: 256 or 512GB as we expect to to heavy faceting and grouping). On top of that we need a dedicated indexing machine, which can also be used for other tasks.