Terabyte index, search and faceting with Solr

Vision, data & index building

Providing responsive freetext search and display with faceting and grouping for the full Danish net archive, for a limited number of researchers. The data in the net archive has been harvested from *.dk-domains using the Heritrix harvester since 2005 and currently amounts to 450TB, with approximately 10 billion entities.

Indexing is done with netarchive/netsearch, developed primarily by Thomas Egense and based on UKWA Webarchive Discovery: A central service keeps track of ARC-files, X controllers requests the path for ARC-files and keeps Y workers running. Each worker uses Tika to analyse the given ARC-file and sends the generated Solr documents to a Solr instance, specified by its controller. When the wanted index size is reached (900GB in our instance), the index is optimized down to a single segment and pushed to the search server.

Currently indexing is done on a single 24 core (48 with HT) Linux machine with 256GB RAM and 7TB SSD in RAID 0, running all parts of the indexing workflow. Sweet spot for that machine is 40 workers and 1 Solr, resulting in 90%+ CPU usage, primarily used by the Tika workers. It takes about 8 days to build one 900GB index. As of 2014-06-17, 4 such indexes has been build.

The indexing machine is not very well balanced with way too much RAM: Each worker runs fine with 1GB, Solr takes 32GB in order to handle merging down to a single 900GB index; 80GB would be enough. The SSDs in RAID 0 also have too much free space; 3-4TB would work fine with room for 2 shards. We expect the machine to be used for other jobs when the full indexing has finished and we switch to a much lighter day-to-day index update.

Search architecture

A single rack-mounted Linux server is responsible for handling the full search load. It is an 16 core (32 with HT) machine with 256GB RAM and 2 disk controllers with a total of 24 1TB commodity Samsung 840 SSDs, each mounted individually, each holding a separate index, each handled by a separate Solr instance. Distributed search is done with SolrCloud. The total cost for the search hardware is < €20K.

Search in the web archive is not high-availability – we accept that there can be downtime. Should a SSD fail, search will be unavailable until a new one has been installed and its index restored from backup. We are looking into using the backup files for failed drives directly from the backup storage as a temporary measure until the drives are ready again, but that is only at the investigation stage.

Basic search performance

At the time of testing, the index consists of 4 shards @ 900GB for a total of 3.6TB index data with 1.2 billion documents. Each Solr instance (one per shard) has 8GB of heap. As the machine is build for 20-24 shards, the index data represents just 1/6th of the expected final size. This leaves the machine vastly overpowered in its current state, with a surplus of CPUs and ~220GB of RAM for disk caching.

How overpowered? We tested back when the index was only 3 shards for a total of 2.7TB: User issued queries are handled with edismax on 6 fields and 1 phrase query on the catch-all field, a max of 20 returned documents with 10-15 stored fields of varying size. We tried hitting the server with just a single thread:

1 thread, 3 shards, 2.7TB, random words, no faceting

256GB RAM, 1 thread, 3 shards, 2.7TB, random words, no faceting

Response times way below 100ms when the number of hits are below 1 million, better than linear scaling above that? On an unwarmed index? Time to up the ante! What about 20 threads, this time on 4 shards for a total of 3.6TB?

20 threads, 4 shards, 3.6TB, random words, no faceting

256GB RAM, 20 threads, 4 shards, 3.6TB, random words, no faceting

It looks a bit like a whale and with 20K points, it is very hard to make sense of. Time to introduce another way of visualizing the same data:

256GB RAM, 20 threads, 4 shards, 3.6TB, random words, no faceting, percentile plot

256GB RAM, 20 threads, 4 shards, 3.6TB, random words, no faceting

This is a Box and Whisker plot, showing the quartiles as well as the min and max response times. The measurements are bucketed with 1-9 hits in the first bucket, 10-99 hits in the next and so forth. Again the magic point seems to be around 1M hits before performance begins to drop. The throughput was 66 searches/second. Repeating the search with 40 threads resulted in about the same throughput and about a doubling of the response times, which indicates that the 16 CPUs is the bottleneck.

Now, the Danish web archive is not Google. Due to legislation, the number of concurrent users will normally be 1 and searches will involve statistics and drill-downs, primarily meaning facets. While very impressive, the measurements above are really not representative of the expected use scenario. Time to tweak the knobs again.

Faceting on high-cardinality fields

For the end-scenario, we plan on faceting on 6 fields. One of them is the URL of the harvested resource, with nearly 1 unique value per resource. That means around 300 million unique values per shard, with 1.2 billion in the current 4 shard index and an estimated 7 billion in the final 24 shard index.

Normally it would seem rather unrealistic to facet on 300M+ documents with nearly as many unique values with 8GB of heap (the allocation for each Solr instance), but there are several things that helps us here:

  • The URL-field is single value, meaning a smaller and faster internal faceting structure
  • Each index is single segment, so no need to maintain a global-ords-mapping structure, fully skipping this normally costly memory overhead
  • DocValues works really well with high-cardinality fields, meaning low memory overhead

For this experiment we switched back to single threaded requests, but added faceting on the URL field. To make this slightly more representative of the expected final setup we also lowered the amount of RAM to 80GB. This left 40GB- for disk caching of the 3.6TB index data, or about 1%.

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr faceting on URL

500-1200ms for a search on 3.6TB with very high-cardinality faceting. Nice. But, how come the response time never gets below 500ms? This is due to a technicality in Solr faceting where it iterates counters for all potential facet terms (1.2 billion in this case), not just the ones that are actually updated. A more thorough explanation as well as a solution can be found in the blog post on Sparse Faceting. Let’s see a graph with both Solr standard and sparse faceting:

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

Or viewed as a Box and Whiskers plot, for sparse faceting only:

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

A quite peculiar looking development of response times. Still, looking at the whale graph at the beginning of this post, they do seem somewhat familiar. This is definitely an experiment that could benefit from a re-run with more sample points. Anyway, notice how even searches with 10M hits are faceted in less than 800ms.


So far our setup for search in the Danish web archive looks very promising. We have showed that searching with faceting on very high-cardinality fields can be achieved with acceptable (< 1 second in our case) response times on relatively cheap hardware. We will continue testing as the index grows and adjust our hardware, should it prove inadequate for the full corpus.

Update 2014-10-11

We have filled 19 out of the 25 SSDs for a total of 17TB or 5.6 billion documents. We have also isolated and partly solved the performance drop at 1000-100,000 hits that is visible in the graphs above; see ten times faster for details. Overall performance remains well within expectations.

Our biggest miscalculation seems to have been for the total size of the the full web archive index: 25 SSDs will not be enough to hold it all. Adding another row of 25 SSDs to the single machine seems unrealistic as the current bottleneck is CPU & RAM-speed, not storage I/O. The current plan is to scale out by buying 1 or 2 extra machines with the same specs as the current.


About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, Faceting, Performance, Solr. Bookmark the permalink.

8 Responses to Terabyte index, search and faceting with Solr

  1. Johng403 says:

    I think this is among the most significant info for me. And i am glad reading your article. But want to remark on some general things, The website style is wonderful, the articles is really great D. Good job, cheers gceeadfkcegc

  2. Dmitry says:

    Are the single segment indices read-only?

  3. Toke Eskildsen says:

    Yes, read-only. We could have made them live with nightly batch updates or so as the optimization to single segments takes less than 8 hours, but as the data does not change once harvested (same resource harvested as a later date is considered a new entry, not an update of the old one), there is no need for this functionality. Besides, that would collide with the idea of building segments up to just below the storage capacity of a single SSD.

    When the full index has been build and new web resources are being harvested, we plan on having a live (multi-segment, batch-updated) shard as well as all the static ones.

    Startup-time for the current 3.6TB index is about 35 seconds, but update-to-search-time for a live shard is a lot less (we have only tested ad-hoc so I have no numbers for this). We are fairly confident about the mixed static-live setup, but if it should prove to be problematic, we can live with nightly updates only.

  4. Dmitry says:

    Single read-only segments should be quite OS cache friendly. In our case we do live updates, but I’d say there is a growing number of more static-like shards that could be optimized to a single segment, because we allow very frequent updates to them. In fact the update requests are allowed, but commit requests are kept rare.

    What’s on the edge of live update is always a challenge to keep performant search speed-wise.

    The startup time is slightly puzzling, do you have any explanation as to why it takes 35 seconds to startup?

  5. Toke Eskildsen says:

    I have no ready answer to the 35 second startup time. It included one facet call on the URL field, så maybe it is explained by DocValues initialization logic? Come to think of it, we only specify Xmx, not Xms, so there might also be a delay for expanding the heap up ? In truth, I have not thought much about that part as restarts/reloads are so rare for the main part of the index.

    Disk cache flush, RAM limitation and Solr restart is handled by Operations and we’re busy preparing for summer vacation, so I can’t really ask for more time on this project right now. I plan on re-testing in August (probably the end of August as things goes) and will make a note of paying special attention to start-up time. I will try let the first query be a simple non-faceting one to get the base startup time separated from faceting initialization.

  6. Dmitry says:

    The reason I paid attention to the startup time was because you said the index was in a single segment. Of course, there can be many reasons as to why startup time can be prolonged by, for example, transaction log replay, cache warmup queries etc etc. But just recently we have observed, that with other equal conditions, index loading time is proportional to the number of segments. We have tried loading an 4,5G index with 240 segments (was very slow, didn’t start even in an hour) and with 10, loading which took almost an instant.

  7. Nat Taylor says:

    I’m wondering about the differences between single shard + 4 segments versus 4 shards + single segment, in terms of parallelization?

    In my particular use case, I care only about the time spent searching the index to retrieve `numFound` …not the later steps like scoring (where perhaps there are other performance considerations like merging)

  8. Toke Eskildsen says:

    Nat: The 4 segments in the single shard will be handled sequentially, while the 4 shards will be handled in parallel + merging, as you mention. With a simple numFound and rows=0, there is near-zero merging overhead, so I would expect the multi-shard to perform better.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s