We have 25 shards of 900GB / 250M documents. It took us 25 * 8 days = half a year to build them. Three fields did not have DocValues enabled when we build the shards:
- crawl_date (TrieDateField): Unknown number of unique values, 256M values.
- links_domains (multi value Strings): 3M unique values, 675M references.
- links_hosts (multi value Strings): 6M unique values, 841M references.
We need DocValues on those fields for faceting. Not just because of speed and memory, but because Solr is technically unable to do faceting without it, at least on the links_domains & links_hosts fields: The internal structures for field cache faceting does not allow for the number of references we have in our index.
The attempted solution
Faced with the daunting task of re-indexing all shards, Hoss at Stump the Chump got the challenge of avoiding doing so. He suggested building a custom Lucene FilterReader with on-the-fly conversion, then using that to perform a full index conversion. Heureka, DVEnabler was born.
DVEnabler takes an index and a list of which fields to adjust, then writes a corrected index. It is still very much Here Be Dragons and requires the user to be explicit about how the conversion should be performed. Sadly the Lucene index format does not contain the required information for a more automatic conversion (see SOLR-6005 for a status on that). Nevertheless it seems to have reached first usable incarnation.
We tried converting one of our shards with DVEnabler. The good news is that it seemed to work: Our fields were converted to DocValues, we could perform efficient faceting and casual inspection indicated they had the right values. Proper test pending. The bad news is that the conversion took 2 days! For comparison, a non-converting plain optimize took just 8 hours.
Our initial shard building is extremely CPU-heavy: 8 days with 24 cores running 40 Tika-processes at 90%+ CPU utilization. The 8 real time days is 192 CPU core days. Solr merge/optimize is single-threaded, so the conversion to DocValues takes 2 CPU core days, or just 1/100 of the CPU resources needed for full indexing.
At the current time it is not realistic to make the conversion multi-threaded, to take advantage of the 24 cores. But it does mean that we can either perform multiple conversions in parallel or use the machine for building new shards, while conversing the old ones. Due to limited local storage, we can run 2 conversions in parallel, while moving unconverted & converted indexes to and from the machine. This gives us an effective conversion speed of 1 shard / 1 day.