Vision, data & index building
Providing responsive freetext search and display with faceting and grouping for the full Danish net archive, for a limited number of researchers. The data in the net archive has been harvested from *.dk-domains using the Heritrix harvester since 2005 and currently amounts to 450TB, with approximately 10 billion entities.
Indexing is done with netarchive/netsearch, developed primarily by Thomas Egense and based on UKWA Webarchive Discovery: A central service keeps track of ARC-files, X controllers requests the path for ARC-files and keeps Y workers running. Each worker uses Tika to analyse the given ARC-file and sends the generated Solr documents to a Solr instance, specified by its controller. When the wanted index size is reached (900GB in our instance), the index is optimized down to a single segment and pushed to the search server.
Currently indexing is done on a single 24 core (48 with HT) Linux machine with 256GB RAM and 7TB SSD in RAID 0, running all parts of the indexing workflow. Sweet spot for that machine is 40 workers and 1 Solr, resulting in 90%+ CPU usage, primarily used by the Tika workers. It takes about 8 days to build one 900GB index. As of 2014-06-17, 4 such indexes has been build.
The indexing machine is not very well balanced with way too much RAM: Each worker runs fine with 1GB, Solr takes 32GB in order to handle merging down to a single 900GB index; 80GB would be enough. The SSDs in RAID 0 also have too much free space; 3-4TB would work fine with room for 2 shards. We expect the machine to be used for other jobs when the full indexing has finished and we switch to a much lighter day-to-day index update.
A single rack-mounted Linux server is responsible for handling the full search load. It is an 16 core (32 with HT) machine with 256GB RAM and 2 disk controllers with a total of 24 1TB commodity Samsung 840 SSDs, each mounted individually, each holding a separate index, each handled by a separate Solr instance. Distributed search is done with SolrCloud. The total cost for the search hardware is < €20K.
Search in the web archive is not high-availability – we accept that there can be downtime. Should a SSD fail, search will be unavailable until a new one has been installed and its index restored from backup. We are looking into using the backup files for failed drives directly from the backup storage as a temporary measure until the drives are ready again, but that is only at the investigation stage.
Basic search performance
At the time of testing, the index consists of 4 shards @ 900GB for a total of 3.6TB index data with 1.2 billion documents. Each Solr instance (one per shard) has 8GB of heap. As the machine is build for 20-24 shards, the index data represents just 1/6th of the expected final size. This leaves the machine vastly overpowered in its current state, with a surplus of CPUs and ~220GB of RAM for disk caching.
How overpowered? We tested back when the index was only 3 shards for a total of 2.7TB: User issued queries are handled with edismax on 6 fields and 1 phrase query on the catch-all field, a max of 20 returned documents with 10-15 stored fields of varying size. We tried hitting the server with just a single thread:
Response times way below 100ms when the number of hits are below 1 million, better than linear scaling above that? On an unwarmed index? Time to up the ante! What about 20 threads, this time on 4 shards for a total of 3.6TB?
It looks a bit like a whale and with 20K points, it is very hard to make sense of. Time to introduce another way of visualizing the same data:
This is a Box and Whisker plot, showing the quartiles as well as the min and max response times. The measurements are bucketed with 1-9 hits in the first bucket, 10-99 hits in the next and so forth. Again the magic point seems to be around 1M hits before performance begins to drop. The throughput was 66 searches/second. Repeating the search with 40 threads resulted in about the same throughput and about a doubling of the response times, which indicates that the 16 CPUs is the bottleneck.
Now, the Danish web archive is not Google. Due to legislation, the number of concurrent users will normally be 1 and searches will involve statistics and drill-downs, primarily meaning facets. While very impressive, the measurements above are really not representative of the expected use scenario. Time to tweak the knobs again.
Faceting on high-cardinality fields
For the end-scenario, we plan on faceting on 6 fields. One of them is the URL of the harvested resource, with nearly 1 unique value per resource. That means around 300 million unique values per shard, with 1.2 billion in the current 4 shard index and an estimated 7 billion in the final 24 shard index.
Normally it would seem rather unrealistic to facet on 300M+ documents with nearly as many unique values with 8GB of heap (the allocation for each Solr instance), but there are several things that helps us here:
- The URL-field is single value, meaning a smaller and faster internal faceting structure
- Each index is single segment, so no need to maintain a global-ords-mapping structure, fully skipping this normally costly memory overhead
- DocValues works really well with high-cardinality fields, meaning low memory overhead
For this experiment we switched back to single threaded requests, but added faceting on the URL field. To make this slightly more representative of the expected final setup we also lowered the amount of RAM to 80GB. This left 40GB- for disk caching of the 3.6TB index data, or about 1%.
500-1200ms for a search on 3.6TB with very high-cardinality faceting. Nice. But, how come the response time never gets below 500ms? This is due to a technicality in Solr faceting where it iterates counters for all potential facet terms (1.2 billion in this case), not just the ones that are actually updated. A more thorough explanation as well as a solution can be found in the blog post on Sparse Faceting. Let’s see a graph with both Solr standard and sparse faceting:
Or viewed as a Box and Whiskers plot, for sparse faceting only:
A quite peculiar looking development of response times. Still, looking at the whale graph at the beginning of this post, they do seem somewhat familiar. This is definitely an experiment that could benefit from a re-run with more sample points. Anyway, notice how even searches with 10M hits are faceted in less than 800ms.
So far our setup for search in the Danish web archive looks very promising. We have showed that searching with faceting on very high-cardinality fields can be achieved with acceptable (< 1 second in our case) response times on relatively cheap hardware. We will continue testing as the index grows and adjust our hardware, should it prove inadequate for the full corpus.
We have filled 19 out of the 25 SSDs for a total of 17TB or 5.6 billion documents. We have also isolated and partly solved the performance drop at 1000-100,000 hits that is visible in the graphs above; see ten times faster for details. Overall performance remains well within expectations.
Our biggest miscalculation seems to have been for the total size of the the full web archive index: 25 SSDs will not be enough to hold it all. Adding another row of 25 SSDs to the single machine seems unrealistic as the current bottleneck is CPU & RAM-speed, not storage I/O. The current plan is to scale out by buying 1 or 2 extra machines with the same specs as the current.