DocValues jump tables in Lucene/Solr 8

Lucene/Solr 8 is about to be released. Among a lot of other things is brings LUCENE-8585, written by your truly with a heap of help from Adrien Grand. LUCENE-8585 introduces jump-tables for DocValues, is all about performance and brings speed-ups ranging from worse than baseline to 1000x, extremely dependent on index and access pattern.

This is a follow-up post to Faster DocValues in Lucene/Solr 7+. The previous post contains an in-depth technical explanation of the DocValues mechanisms, while this post focuses on the final implementation.

DocValues?

Whenever the content of a field is to be used for grouping, faceting, sorting, stats or streaming in Solr (or Elasticsearch or Lucene, where applicable), it is advisable to store it using DocValues. It is also used for document retrieval, depending on setup.

DocValues in Lucene 7: Linked lists

Lucene 7 shifted the API for DocValues from random access to sequential. This meant smaller storage footprint and cleaner code, but also caused the worst case single value lookup to scale linear with document count: Getting the value for a DocValued field from the last document in a segment required a visit to all other value blocks.

The linear access time was not a problem for small indexes or requests for values for a lot of documents, where most blocks needs to be visited anyway. Thus the downside of the change was largely unnoticeable or at least unnoticed. For some setups with larger indexes, it was very noticeable and for some of them it was also noticed. For our netarchive search setup, where each segments has 300M documents, there was a severe performance regression: 5-10x for common interactive use.

Text book optimization: Jump-tables

The Lucene 7 DocValues structure behaves as a linked list of data-nodes, with the specializations that it is build sequentially and that it is never updated after the build has finished. This makes it possible to collect the node offsets in the underlying index data during build and to store an array of these offsets along with the index data.

With the node offsets quickly accessible, worst-case access time for a DocValue entry becomes independent of document count. Of course, there is a lot more to this: See the previously mentioned Faster DocValues in Lucene/Solr 7+ for details.

One interesting detail for jump-tables is that they can be build both as a cache on first access (see LUCENE-8374) and baked into the index-data (see LUCENE-8585). I much preferred having both options available in Lucene, to get instant speed up with existing indexes and technically superior implementation for future indexes. Alas, only LUCENE-8585 was deemed acceptable.

Best case test case

Our netarchive search contains 89 Solr collections, each holding 300M documents in 900GB of index data. Each collection is 1 shard, merged down to 1 segment and never updated. Most fields are DocValues and they are heavily used for faceting, grouping, statistics, streaming exports and document retrieval. The impact of LUCENE-8585 should be significant.

In netarchive search, all collections are searched together using an alias. For the tests below only a single collection was used for practical reasons. There are three contenders:

  1. Unmodified Solr 7 collection, using Solr 8.0.0 RC1. Codename Solr 7.
    In this setup, jump-tables are not active as Solr 8.0.0 RC1, which includes LUCENE-8585, only supports index-time jump-tables. This is the same as Solr 7 behaviour.
  2. Solr 7 collection upgraded to Solr 8, using Solr 8.0.0 RC1. Codename Solr 8r1.
    In this setup, jump-tables are active and baked into the index data. This is the expected future behaviour when Solr 8 is released.
  3. Solr 7 collection, using Lucene/Solr at git commit point 05d728f57a28b9ab83208eda9e98c3b6a51830fc. Codename Solr 7 L8374.
    During LUCENE-8374 (search time jump tables) development, the implementation was committed to master. This was later reverted, but the checkout allow us to see what the performance would have been if this path had been chosen.

Test hardware is a puny 4-core i5 desktop with 16GB of RAM, a 6TB 7200RPM drive and a 1TB SSD. About 9GB of RAM free for disk cache. Due to time constraints only the streaming export test has been done on the spinning drive, the rest is SSD only.

Streaming exports

  • Premise: Solr’s export function is used by us to extract selected fields from the collection, typically to deliver a CSV-file with URLs, MIME types, file sizes etc for a corpus defined by a given filter. It requires DocValues to work.
  • DV-Problem: The current implementation of streaming export in Solr does not retrieve the field values in document order, making the access pattern extremely random. This is absolute worst case for sequential DocValues. Note that SOLR-13013 will hopefully remedy this at some point.

The test performs a streaming export of 4 fields for 52,653 documents in the 300M index. The same export is done 4 times, to measure the impact of caching.

curl '<solr>/export?q=text:hestevogn&sort=id+desc&
fl=content_type_ext,content_type_served,crawl_date,content_length'
run 1
seconds
run 2
seconds
run 3
seconds
run4
seconds
Solr 7 spin1705129713521314
Solr 8r1 spin834321
Solr 7 L8374 spin935111
Solr 7 SSD1276125812621262
Solr 8r1 SSD16121
Solr 7 L8374 SSD151
11

Observation: Both Solr 8r1 and Solr 7 L8374 vastly outperforms Solr 7. On a spinning drive there is a multi-minute penalty for run 1 after which the cache has been warmed. This is a well known phenomenon.

Faceting

  • Premise: Faceting is used everywhere and it is a hard recommendation to use DocValues for the requested fields.
  • DV-Problem: Filling the counters used when faceting is done in document order, which works well with sequential access as long as the jumps aren’t too long: Small result sets are relatively heavier penalized than large result sets.
Simple term-based searches with top-20 faceting on 6 fields of varying type and cardinality: domain, crawl_year, public_suffix, content_type_norm, status_code and host.

Reading the graphs: All charts in this blog post follows the same recipe:

  • X-axis is hit count (aka result set size), y-axis is response time (lower is better)
  • Hit counts are bucketed by order of magnitude and for each magnitude, boxes are shown for the three contenders: Blue boxes are Solr 7, pink are Solr 8r1 and green are Solr 7 L8374
  • The bottom of a box is the 25 percentile, the top is the 75 percentile. The black line in the middle is the median. Minimum response time for the bucket is the bottom spike, while the top spike is 95 percentile
  • Maximum response times are not shown as they tend to jitter severely due to garbage collection

Observation: Modest gains from jump-tables with both Solr 8rc1 and Solr 7 L8374.
Surprisingly the gains scale with hit count, which should be investigated further.

Grouping

  • Premise: Grouping is used in netarchive search to collapse multiple harvests of the same URL. As with faceting, using DocValues for grouping fields are highly recommended.
  • DV-Problem: As with faceting, group values are retrieved in document order and follows the same performance/scale logic.
Simple term-based searches with grouping on the high-cardinality (~250M unique values) field url_norm.

Observations: Modest gains from jump-tables, similar to faceting.

Sorting

  • Premise: Sorting is a basic functionality.
  • DV-Problem: As with faceting and grouping, the values used for sorting are retrieved in document order and follows the same performance/scale logic.

This tests performs simple term-based searches with sorting on the high-cardinality field content_length.

Observations: Modest gains from jump-tables.
Contrary to faceting and grouping, performance for high hit counts are the same for all 3 setups, which fits with the theoretical model.
Positively surprising is that the theoretical overhead of the jump-tables does not show for higher hit counts.

Document retrieval

  • Premise: Content intended for later retrieval can either be stored explicitly or as docValues. Doing both means extra storage, but also means that everything is retrieved from the same stored (and compressed) blocks, minimizing random access to the data. For the netarchive search at the Royal Danish Library we don’t double-store field data and nearly all of the 70 retrievable fields are docValues.
  • DV-Problem: Getting a search result is a multi-step process. Early on, the top-X documents matching the query are calculated and their IDs are collected. After that the IDs are used for retrieving document representations. If this is done from DocValues, it means random access linear to the number of documents requested.
Simple term-based relevance ranked searches for the top-20 matching documents with 9 core fields: id, source_file_s, url_norm, host, domain, content_type_served, content_length, crawl_date and content_language.

Observations: Solid performance improvement with jump-tables.

Production request

  • Premise: The different functionalities are usually requested in combination. At netarchive search a typical request uses grouping, faceting, cardinality counting and top-20 document retrieval.
  • DV-Problem: Combining functionality often means that separate parts of the index data are accessed. This can cause cache thrashing if there is not enough free memory for disk cache. With sequential DocValues, all intermediate blocks needs to be visited, increasing the need for disk cache. Jump-tables lowers the number of storage requests and are thus less reliant on cache size.
Simple term-based relevance ranked searches for the top-20 matching documents, doing grouping, faceting, cardinality and document retrieval as described in the tests above.

Observations: Solid performance improvement with jump tables.
As with the previous analysis of search-time jump tables, utilizing multiple DocValues-using functionality has a cocktail effect where the combined impact is larger than the sum of the parts. This might be due to disk cache thrashing.

Overall observations & conclusions

  • The effect of jump tables, both with Solr 8.0.0 RC1 and LUCENE-8374, is fairly limited; except for export and document retrieval, where the gains are solid.
  • The two different implementations of jump tables performs very similar. Do remember that these tests does not involve index updates at all: As LUCENE-8374 is search-time, it does have a startup penalty when indexes are updated.

For a the large segment index tested above, the positive impact of jump tables is clear. Furthermore there is no significant slow down for higher hit counts with faceting/grouping/statistics, where the jump tables has no positive impact.

Before running these tests, it was my suspicion that the search-time jump tables in LUCENE-8374 would perform better than the baked-in version. This showed not to be the case. As such, my idea of combining the approaches by creating in-memory copies of some of the on-storage jump tables has been shelved.

Missing

Performance testing is never complete, it just stops. Some interesting thing to explore could be

  • Spinning drives
  • Concurrent requests
  • Raw search speed with rows=0
  • Smaller corpus
  • Variations of rows, facet.limit and group.limit
  • Kibana and similar data-visualization tools
Advertisements

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, Hacking, Low-level, Lucene, Performance, Solr, Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s