Our web archive index passed the 10TB mark a few days ago, so it was time for new performance measurements. To recap: 12 shards @ 900 GB, a total of 10.7TB or 3.6 billion documents. Served from a single 256GB machine with a 1 TB SSD dedicated for each shard.
We started by simple sequential searches for random words from a Danish dictionary. No faceting or other fancy stuff. The requests were for top-10 documents with their stored content. We measured the full processing-time (i.e. HTTP-get) and got this:
We call it the whale and we have been a bit obsessed with it, since we discovered it 3 months ago when we saw it with 4 shards. Response times for 100 to 1 million hits are great, but what happens with the response times around 10 hits!? Inspection of the Solr logs showed nothing suspicious: Solr’s reported processing time (QTime) for the individual shards were 1 or 2 ms for the queries in question, while QTime for the merging Solr instance was 20-50 ms. Those are fine numbers.
Some of the queries with 10 hits were “blotlægge gøglertrup”, “eksponeringstid attestering” and “fuldkost hofleverandør” (quite funny in Danish actually; remember the words were selected at random). Those searches all took around 500 ms, measured from the outside of Solr, with reported QTimes below 50 ms. Could it be a HTTP-pequliarity, as Mikkel Kamstrup suggested? Diving into the concrete responses illuminated us.
Simple queries with very few hits in a large corpus happens because the query terms rarely occur in the same document. So which documents has a high chance of co-occurrence of random words from the dictionary? A dictionary of course! In a (hopefully vain) attempt of “search engine optimization”, some Danish web pages has embedded a dictionary below the real content (assumedly hidden by making the font color the same as the background or something like that). Normally such pages are ranked low due to the magic of Lucene/Solr, but with very few hits, they still become part of the search result.
So, statistically speaking, searches with few results gives us huge pages. Internally in Solr they are still processed quite fast (hence the fine QTime-numbers), but serializing the result to XML is not a light task, when the result is measured in megabytes. Had we just requested a few fields, such as URL and content_type, there would have been no hiccup. But we requested everything stored, including content_text. If we just request 15 limited-length fields for each documents and repeat the test, we get this:
Now that was strange. We got rid of the hump back, but overall the performance suffered? Does it take more time to ask for specific stored fields instead of all? Still, response times below 100 ms for the majority of searches is quite acceptable. Mystery considered solved!