Webscale in danish

by

It does count as web scale if you actually index the web, right? The danish part of the web at least, which is not at all easy to define. Nevertheless Netarkivet currently has 372 TB of harvested data, which translates to roughly 10 billion (10,000,000,000) indexable entities. We have indexed about 1/100th of this in Solr as a test and that gave us an index of about 200 GB; the full corpus should thus take about 20 TB. The question is what kind of hardware it takes to search it all?

Soft requirements

Due to danish copyright and privacy laws, we cannot provide general access to the archive. This means that usage will be limited, which again means that the number of possible concurrent searches can be set to 1. Of course 2 people can search at the same time, but it is okay if that doubles the response times. We thus only need to look at latency – i.e. the time it takes from issuing a search until a result is received.

Faceting and grouping by URL would be very useful when searching the archive, but as the only current access possibility is “Give us the resource with this specific URL”, any search at all would be a great step forward.

Hardware possibilities

Our knee-jerk reaction to hardware for search is to order Solid State Drives, but with a limited amount of users, 20 TB of SSDs might be overkill. Luckily we do have an Isilon system with some terabytes to spare for testing as well as production. The Isilon is networked storage and it is very easy to expand so it would be great if that could deliver the needed storage performance. Blade servers with virtual machines deliver the CPU horse power, but as we shall see it really is all about IOPS (In/Out Operations Per Second).

SolrCloud

Building one single 20 TB index would be an intriguing endeavour but alas, only 2 billion documents/segment in Lucene/Solr. Besides, this would be quite devastating in case of index errors. Furthermore there is a high chance that it would provide lower performance compared to multiple indexes as a lot of the internal seeking operations for a single index are sequential.

Some sort of distribution seems to be the way forward and SolrCloud is first candidate (sorry, ElasticSearch, maybe next time?). We need to estimate what kind of iron it takes to handle the 20TB and we need to estimate a proper shard-count vs. shard-size ratio. Is 100 shards of 200GB better than 10 shards of 2TB? Since our content is ever-growing and never-updating we have the luxury of being able to optimize and freeze our shards as we build them one at a time. No frustrating “index time vs. search time”-prioritization here.

Numbers, please

Real Soon Now, promise. First we need to determine the bottleneck for our box. CPU or IO? To do so we measure cloud sizes vs. number of concurrent searches. By taking the response time and dividing by (#threads * #shards), we get the average response time for the individual shards. If that number plateaus independently of #threads as well as #shards, we have the maximum performance for our box.

For query testing we do an OR-query with 2-4 random danish words and light faceting. This should be representative of the real-world queries, at least for IO. As we cannot flush the cache on the Isilon box (Operations would kill us if we managed to do so), each test is preceded by a short (~100 queries) warmup run and a hope that the numbers will be consistent. Collected statistics are based on Solr-reported Query-Times and collected for all threads. We checked and that information is very close to full response times.

Confused? No worries, it gets a lot clearer when we have numbers.

Spinning numbers!

The Isilon system used spinning drives with a fair amount of combined SSD and RAM caching in front. Remember that reported times are response_time/(#threads * #shards) ms.

Testing with 200GB (75M documents) indexes, Median (50% percentile), remote Isilon

  Threads  
Shards 1 2 4 8 16 32 avg
1 28 10 4 2 1 1 7
2 99 47 35 34 34 34 47
3 77 42 36 38 41 45 47
4 67 47 47 54 52 56 54
5 67 57 54 54 53 56 57
6 64 56 53 58 59 58 58
7 63 56 53 50 55 55 55
8 72 60 58 60 62 61 62
9 71 60 55 57 58 56 60
10 71 58 57 57 56 54 59
11 74 65 61 60 58 62 63
12 70 61 58 61 60 58 61
12 73 63 65 67 70 69 68
13 83 66 67 74 70 65 71
14 78 71 68 71 72 80 73
15 85 75 70 70 70 78 75
16 74 68 68 67 66 75 70
17 78 68 70 70 70 79 72
16 77 71 69 70 70 80 73

Testing with 200GB (75M documents) single segment indexes, 99% percentile, remote Isilon

  Threads  
Shards 1 2 4 8 16 32 avg
1 157 55 33 25 14 39 54
2 230 104 87 85 69 68 107
3 208 98 93 82 77 84 107
4 166 111 109 104 88 95 112
5 152 129 110 93 90 94 112
6 128 130 110 100 96 98 110
7 125 124 98 85 91 92 103
8 128 113 99 104 106 95 107
9 151 107 97 94 91 99 106
10 121 106 97 100 88 90 100
11 136 109 107 102 94 109 109
12 117 109 100 100 94 95 102
12 139 108 139 107 143 174 135
13 140 111 158 121 116 112 126
14 134 119 108 122 131 119 122
15 154 152 123 120 128 116 132
16 121 109 107 100 112 104 109
17 124 115 111 112 122 117 117
16 126 119 112 104 122 118 117

Testing with 420GB (174M documents) single segment indexes, Median (50% percentile), remote Isilon

  Threads  
Shards 1 2 4 8 16 32 avg
1 17 9 6 3 2 1 6
2 81 50 42 42 41 44 50
3 91 52 44 45 44 51 54
4 83 57 53 54 52 56 59
5 77 59 54 54 55 56 59
6 79 59 55 53 54 53 59
7 76 61 57 57 55 55 60
8 77 61 56 58 59 60 62
9 75 64 64 64 63 63 66

Testing with 420GB (174M documents) single segment indexes, 99% percentile, remote Isilon

  Threads  
Shards 1 2 4 8 16 32 avg
1 154 85 70 38 18 15 63
2 255 126 105 97 78 72 122
3 207 165 103 86 79 79 120
4 181 118 102 98 92 82 112
5 164 114 100 91 88 83 107
6 168 123 95 90 80 78 106
7 146 117 93 80 81 85 100
8 150 108 98 91 83 92 104
9 134 114 102 103 97 107 109

Isilon observations

The average response time/shard/thread gets markedly better when going from 1 to 2 threads. This makes sense, as a single-thread is alternating between CPU-wait and IO-wait, thus not utilizing both resources fully. More threads does not improve performance by much and in a lot of cases, 32 threads are slower than 16 threads.

Our goal was to estimate the hardware requirements for the full 20TB index files. We see from the tables above that the average response times for the individual shards to grow with the number of shards and the number of threads, but not critically. The average of the medians for 8 shards of 200GB is 62 ms, while it is 73 ms for 16 shards; this is a 17% increase for double the amount of data. With a 20TB estimate, we would need 100 shards. Expecting about 20% increase for each doubling of the shard count, the average shard response time would be approximately 115 ms for a total of 11,500 ms.

This might be somewhat optimistic as the cache works significantly better on smaller datasets and because the average median is lower than the single thread median, which is more telling for the expected usage pattern.

Looking at the average median for 420GB shards, we have 59 ms for 4-5 shards and 66 ms for 9 shards, or an response time increase of only 11% for a doubling of the index size. Extrapolating from this, the full corpus of about 50 shards would have an average shard response time of approximately 85 ms for a total of 4,250 ms. As this is less than half of the estimated time for the 200GB shards, this seems suspiciously low (or the estimation for the 200GB shards is too high).

Those were medians. Repeating for 99 percentile we get 14,500 ms for 200GB shards and 5,500 ms for 420GB. Again, the 420GB measurements seems to be “too good”, compared to the 200GB ones.

Solid numbers

We do not have terabytes of unused space on Solid State Drives lying around, but a temporary reorganizing on our test machine freed just above 400GB. Enough for testing 2 * 200GB. It is important to keep in mind that this storage is both local & SDD as opposed to the Isilon box which is remote & spinning drives.

Note that the two rows in each test are for the same number of shards (2). This was done to see whether warming/caching improved response times.

Testing with 200GB (75M documents) indexes, Median (50% percentile), local SSD

  Threads  
Shards 1 2 4 8 16 32 avg total GB avg full
2 20 8 4 3 2 3 7 400 13
2 16 8 4 3 2 2 6 400 12

Testing with 200GB (75M documents) single segment indexes, 99% percentile, local SSD

  Threads  
Shards 1 2 4 8 16 32 avg total GB avg full
2 64 35 31 14 10 8 27 400 54
2 36 30 26 10 8 13 21 400 41

Solid observations

As opposed to spinning drives, the performance continues to improve markedly with threading up to around 16 threads. Keeping in mind that this is on a 16 core machine, this tells us that we do not have enough data to be sure whether the SSDs or the CPUs are the main bottleneck.

Note to self: It would be interesting to try and limit the number of active CPUs by using taskset.

One thing we can see is that the SSD-solution seems to be much better performing than the spinning drives one. But how much is much?

Putting it all together

With only enough free SSD for 2*200GB shards, we are forced to project the final load by threading. Our maintenance guys very temporarily freed 400GB of SSD on the Isilon box so we were able to see whether the local vs. remote parameter had a lot of influence on performance.

2*200GB shards

Threading with 2*200GB shards

Keep in mind that this is a very loose projection and especially keep in mind that bigger shards are relatively faster while bigger test data are relatively slower. Still, the conclusion seems very clear: Using SSDs as storage is a whole other level of performance, when only a small part of the index can be cached in RAM.

Sweet sweet hardware

We all know that SSDs are expensive. Except that they are not. Yes they cost 10 times as much as spinning drives when looking at capacity only, but when the whole point about SSDs is speed, it is plain stupid just to look at capacity.

As our analysis above shows, scaling to satisfactory response times (median < 1 second in this case) with spinning drives would require 5-10 times the speed we currently get from Isilon. As the Isilon box is already using RAID for speed improvement (and crash recovery), this seems like quite a task. Compensating for spinning drives by increasing the amount of RAM for caching is unrealistic with 20TB of index.

It is simply cheaper and faster to base our search machine on SSDs. We can handle crashes by mounting each drive separately and using shards the same size as each drive. If a drive fails, the corresponding shard backup (which will probably be located on Isilon) takes over until the shard has been copied to a new SSD. Our hardware guys suggests

  • 1x 24×2.5″ Intel JBOD2000 2U RACK, Dual PSU, dual backplane
  • 24x 1TB Samsung 840 EVO
  • 1x Intel RS25GB008 JBOD controller with 8 external ports
  • A server to hold the controller

Estimated price 100-200,000 danish kroner or 20-40,000 USD, depending on RAM (suggested amount: 256 or 512GB as we expect to to heavy faceting and grouping). On top of that we need a dedicated indexing machine, which can also be used for other tasks.

About these ads

2 Responses to “Webscale in danish”

  1. Toke Eskildsen Says:

    Short update: The hardware has been bought: A search-box that will also index until we have generated the first full collection of indexes. Plus an index-build-box which are likely to be used for other jobs too later on.

    It takes 7-10 days to build a 1TB shard with the current setup at the full capacity of one of the machines. Simple math says 20 * 10 days / 2 ~= 3 months to be ready. The logistics code is in place and we have performed several test runs, so full-scale index generation should start next week.

  2. Ten times slower | Software Development at Statsbiblioteket Says:

    […] At Statsbiblioteket we are building a SolrCloud web index of our harvested material. 9 months ago that was 372 TB or 10 billion entities. It has probably passed 400 TB by now. Our insane plan was to put it all on a single machine and to do faceted searches. Because it made sense and maybe a little because it is pure awesome to handle that amount of information from a single box. Read about our plans in the post Webscale in Danish. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: