Samsung 840 EVO degradation

Rumour has it that our 25 lovely 1TB Samsung 840 EVO drives in our Net Archive search machine does not perform well, when data are left untouched for months. Rumour in this case being solid confirmation with a firmware fix from Samsung. In our setup, index shards are written once, then left untouched for months or even years. Exactly the circumstances that trigger the performance degradation.

Measurements, please!

Our 25 shards are build over half a year, giving us an unique opportunity to measure drives in different states of decay. First experiment was very simple: Just read all the data from the drive sequentially by issuing cat index/* > /dev/null and plot the measured time spend with the age of the files on the x axis. That shows the impact on bulk read speed. Second experiment was to issue Solr searches to each shard in isolation, testing search speed one drive at a time. That shows the impact on small random reads.

Samsung 840 EVO performance degradationFor search as well as bulk read, lower numbers are better. The raw numbers for 7 drives follows:

Months 25% search median search 75% search 95% search mean search Bulk read hours
0.9 36 50 69 269 112 1.11
2.0 99 144 196 486 203 3.51
3.0 133 198 269 590 281 6.06
4.0 141 234 324 670 330 8.70
5.1 133 183 244 520 295 5.85
6.0 106 158 211 433 204 5.23
7.0 105 227 333 703 338 10.49

Inspecting the graph, it seems that search performance quickly gets worse until the data are about 4 months old. After that it stabilizes. Bulk reading on the other hand continue to worsen during all 7 months, but that has little relevance for search.

The Net Archive search uses SolrCloud for querying the 25 shards simultaneously and merging the result. We only had 24 shards at the previous measurement 6 weeks ago, but the results should still be comparable. Keep in mind that our goal is to have median response times below 2 seconds for all standard searches; searches matching the full corpus and similar are allowed to take longer.

Performance for full searchThe distinct hill is present both for the old and the new measurements: See Even sparse faceting is limited for details. But the hill has grown for the latest measurements; response times has nearly doubled for the slowest searches. How come it got that much worse during just 6 weeks?

Theory: In a distributed setup, the speed is dictated by the slowest shard. As the data gets older on the un-patched Samsung drives, the chances of having slow reads rises. Although the median response time for search on a shard with 3 month old data is about the same as one with 7 month old data, the chances of very poor performance searches rises. As the whole collection of drives got 6 weeks older, the chances of not having poor performance from at least one of the drives during a SolrCloud search fell.

Note how our overall median response time actually got better with the latest measurement, although the mean (average) got markedly worse. This is due to the random distribution of result set sizes. The chart paint a much clearer picture.

Well, fix it then!

The good news is that there is a fix from Samsung. The bad news is that we cannot upgrade the drives using the controller on the server. Someone has to go through the process of removing them from the machine and perform the necessary steps on a workstation. We plan on doing this in January and besides the hassle and the downtime, we foresee no problems with it.

However, as the drive bug is for old data, a rewrite of all the 25*900GB files should freshen the charges and temporarily bring them back to speed. Mads Villadsen suggested using dd if=somefile of=somefile conv=notrunc, so let’s try that. For science!

*crunching*

It took nearly 11 hours to process drive 1, which had the oldest data. That fits well with the old measurement of bulk speed for that drive, which was 10½ hour for 900GB. After that, bulk speed increased to 1 hour for 900GB. Reviving the 24 other drives was done in parallel with a mean speed of 17MB/s, presumably limited by the controller. Bulk read speeds for the reviewed drives was 1 hour for 900GB, except for drive 3 which took 1 hour and 17 minutes. Let’s file that under pixies.

Repeating the individual shard search performance test from before, we get the following results:Samsung 840 EVO performance reviving
Note that the x-axis is now drive number instead of data age. As can be seen, the drives are remarkably similar in performance. Comparing to the old test, they are at the same speed as the drive with 1 month old data, indicating that the degradation sets in after more than 1 month and not immediately. The raw numbers for the same 7 drives as listed in the first table are:

Months 25% search median search 75% search 95% search mean search Bulk read hours
0.3 34 52 69 329 106 1.00
0.3 41 55 69 256 104 1.00
0.3 50 63 85 330 131 1.00
0.3 39 58 77 301 108 1.00
0.3 40 57 74 314 106 1.01
0.3 37 50 66 254 96 1.01
0.3 24 33 51 344 98 1.00

Running the full distributed search test and plotting the results together with the 1½ month old measurements as well as the latest measurements with the degraded drives gives us the following.

Samsung 840 EVO performance reviving

Performance is back to the same level as 1½ month ago, but how come it is not better than that? A quick inspection of the machine revealed that 2 backup jobs had started and were running during the last test; it is unknown how heavy that impact is on the drives, so the test will be re-run when the backups has finished.

Conclusion

The performance degradation of non-upgraded Samsung 840 EVO drives is very real and the degradation is serious after a couple of months. Should you own such drives, it is highly advisable to apply the fixes from Samsung.

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, Low-level, Performance. Bookmark the permalink.

One Response to Samsung 840 EVO degradation

  1. Pingback: Finding the bottleneck | Software Development at Statsbiblioteket

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s