Net Archive Search

The Danish Net Archive has harvested more than 500TB of primarily Danish web resources since 2005. During 2014, a large part of this was made searchable through Solr. As an ongoing project, it is expected to catch up with the present medio 2015.

Hardware

Index and search hardware late 2014

Index and search hardware for the Danish Net Archive, late 2014. Yes, that is all.

There are currently two machines running, with a large amount of local storage and without virtualization. Past experience has lead to a policy of running non-trivial Solr installations on dedicated hardware and with local solid state drives for index storage.

Belinda is responsible for building index shards. It is a 24 core Intel Xeon machine, making it 48 core if one counts Hyper-Threading as core-doubling. It has 256GB of RAM, which is about twice as much as needed. Local storage is 5*1.6TB “read-optimized” enterprise SSDs in RAID 0, which is overkill for an indexer. The bottleneck for the machine is without a doubt CPU-power.

Rosalind is responsible for searching. It is a 16 core, 256GB RAM Intel Xeon machine. Local storage is 25*932GB Samsung 840 EVO consumer SSDs, mounted as individual drives. The machine is well-balanced with no single bottleneck.

The most expensive components in the machines are the solid state drives. The Fall 2014 price of a single 932GB Samsung 840 EVO is $500 / €400. The 1.6TB enterprise drives in belinda are more expensive, but not overly so due to being “read-optimized”, which is code-speak for not being suitable for databases or similar work with heavy random writes.

Software

The shard builder belinda uses netarchivesuite/netsearch. It keeps track of harvested web resources in the form of ARC-files from the Net Archive, extracts data from them using Tika and feeds the result into a single Solr running under Tomcat. Each Tika process uses a maximum of 1GB, while the Solr indexer has a maximum of 32GB.

Shards are self-contained indexes that can be searched in parallel to be used as a single super-index. Shards are build to match the SSDs on rosalind and takes up 892-902GB each. It takes belinda 8 days to build a single shard, with nearly all processing power being used for the Tika processes. The final step in shard building is a full optimize, down to a single Solr index segment.

The searcher rosalind runs 25 Solrs with sparse faceting, one searcher/shard/SSD. Each Solr has a maximum of 8GB of heap and runs as part of a SolrCloud. Currently 24 out of the 25 Solrs are active, with the final shard expected at the end of November 2014. A 26th Solr is used as entry-point for searches and has an empty index. This allows for the memory-overhead of merging search-results and makes it possible to limit the amount of concurrent searches in the SolrCloud.

Workload

The index is still in its early stages. Current workload shifts between extensive performance tests and light manual experiments by developers and researchers to probe for possibilities. This pattern is expected to shift to heavy batched statistics extraction jobs as well as explorative searches by researchers. Hopefully it will be possible to open up some aggregation functionality to the general public, such as word trends over time or visualization of links between sites.

Due to Danish legislation, the chances of public access to direct searches in the full index are very small. General access to smaller parts of the corpus is more realistic, but will probably be handled by a dedicated index.

Performance

The loosely defined performance goal is a median response time below 2 seconds for standard terms-based searches with faceting on selected fields. Such searches are the base in our performance tests. Note that all tests are executed with unique queries.

2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 20TB / 7 billion documents

256GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 20TB / 7 billion documents

  • Faceted search performance is acceptable with 24 shards and we expect it to stay that way when the final shard is added.
  • “Worst-case”-searches, where practically the whole corpus is matched and full faceting is performed, takes about 3 minutes.
  • Simple searches without faceting, used for things like brute-force checking for national identification numbers, is processed at about 50 searches/second.

Consumer drives without RAID

The choice of consumer Samsung 840 EVO drives was a budget conscious decision. At purchase time, enterprise SSDs were still fairly expensive relative to consumer drives and did not provide any technical benefits for our setup. As of late 2014, that is still the case.

Mounting the 25 drives separately instead of, for example, RAID 5-ing in 5 logical groups of 4*900GB capacity, was chosen due to the limited usage. Currently a failed drive means about 1 day downtime (or limiting the search to 24 shards), which is acceptable. On the drawing board is instant switch to the backup copy on spinning drives, if a shard fails.

Future

In order to estimate the impact of shard count on performance, we have tried testing with different amounts of active shards from 1 (900GB) to 24 (20TB). The amount of free memory for disk cache was kept constant during these tests.

2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 1, 2, 4, 8, 16 & 24 shards,

256GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 1, 2, 4, 8, 16 & 24 shards,

As can be seen, performance do go down when the number of shards rises. We would expect that further scaling up by doubling the storage capacity on the server and doubling the memory as well would result half the total performance. Based on this, the State and University Library, Denmark, will instead be purchasing 2 new machines with the same specifications as the searcher rosalind in early 2015.

We expect to temporarily use the CPU power of one of the new machines to help build shards, so that a new shard will be generated every 4 days instead of 8. The 25 SSDs of the first extra machine should thus be filled in 3 months.

 

13 Responses to Net Archive Search

  1. Pingback: Samsung 840 EVO degradation | Software Development at Statsbiblioteket

  2. Pingback: Finding the bottleneck | Software Development at Statsbiblioteket

  3. Pingback: British Library and IIPCTech15 | Software Development at Statsbiblioteket

  4. Pingback: Exhaustive ID filters in Solr | Software Development at Statsbiblioteket

  5. Pingback: Long tail packed counters for faceting | Software Development at Statsbiblioteket

  6. Pingback: Net archive indexing, round 2 | Software Development at Statsbiblioteket

  7. Pingback: N-plane packed counters for faceting | Software Development at Statsbiblioteket

  8. Pingback: Facet filtering | Software Development at Statsbiblioteket

  9. Pingback: Measuring N-plane time & space | Software Development at Statsbiblioteket

  10. Pingback: Heuristically correct top-X facets | Software Development at Statsbiblioteket

  11. Pingback: Net Archive Search building blocks | Software Development at Statsbiblioteket

  12. Pingback: Dubious guesses, counted correctly | Software Development at Statsbiblioteket

  13. Pingback: Sampling methods for heuristic faceting | Software Development at Statsbiblioteket

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s