Speeding up core search

Issue a query, get back the top-X results. It does not get more basic with Solr. So great win if we can improve on that, right? Truth be told, the answer is still “maybe”, but read on for some thoughts, test code and experimental results.

Getting the top-X results

  • The X in top-X is the maximum result set size we want back from a request.
  • A result set is the documents actually returned – this can be the same or less than X.
  • hits is the number of documents that matched the query – this can be much higher than the size of the result set.

When a top-X search is performed, Solr uses a Priority Queue to keep track of the result set. Such a queue is often implemented as a heap and so it is with Solr: The relevant class in the source is HitQueue.

A ScoreDoc represents a single result document, in the form of a document ID and a score (and sometimes a shard index, but we ignore that for now). The HitQueue holds ScoreDocs and is a simple beast: Add a new ScoreDoc and it will check to see if it has a higher score than any of the existing ScoreDocs. If so, the new ScoreDoc is added. If the queue is full, the ScoreDoc with the lowest score is removed.

Potential problems with the current implementation

  • A simple binary heap has poor memory locality: This means that an insert into the heap often results in multiple memory accesses scattered in the heap structure. This is not a problem for a tiny heap as everything will be in the CPU L2 cache, but for larger heaps that means a lot of uncached memory accesses.
  • The HitQueue in Solr uses ScoreDoc objects as elements in the heap. When the HitQueue is created, it is filled with dummy-ScoreDocs, called sentinels. There are multiple problems with that.
    • Each ScoreDoc object takes up 28 bytes on the heap. If 30 concurrent requests asks for top-2500000, that takes up 2GB of RAM, no matter what the actual result size is. I mention those specific numbers as that was the case for a person on the solr IRC channel.
    • Each ScoreDoc object is temporary, which means a lot of allocations and a lot of work for the garbage collector to clean up. In the previous case of the top-2500000, the JVM was doing stop-the-world garbage collections half the time.
    • The use of sentinel objects means that the heap is pre-filled with elements that will always be less than any real elements. Adding an element means sifting it from the top of the heap to the bottom. Not a problem for small heaps, but with larger ones it means unnecessary memory thrashing until the heap is full of new elements.
  • If the queue is not filled to its max size (the X in top-X), the ongoing maintaining of a heap structure is not the most efficient solution. For that, it would be better to simply collect all elements and do a merge-sort or similar, when they are to be delivered in order.

To sum it up: Solr’s queue in theory works well when requesting a small number of results, but poorly for large numbers. And indeed, that is part of the shared experience with Solr.

Lately CursorMark has been added to Lucene & Solr, which allows for efficient paging of results. One could say that this renders the optimization of requests for large result sets is a moot point, but hey, it’s a challenge.

Experimental improvements

Switching to a better heap algorithm would be a worthy experiment, but that is hard so we will skip that for now. Instead we will do away with all those pesky ScoreDoc objects.

The two relevant parts of a ScoreDoc are docID (an integer) and score ( a float). The sorting order of two ScoreDocs is determined primarily by the score and secondarily by the docID. There are a lot of such comparisons when using a heap.

In Java, the bits of a float can be extracted by Float.floatToRawIntBits, which produces an integer. This is a very fast operation. Interestingly enough, the sort order of two positive floats is preserved in the integer representations. This means that a ScoreDoc can be converted into an atomic long with Float.floatToRawIntBits(score) << 32 | docID and that two such longs are directly comparable.

With each element as a long, the object-guzzling ScoreDoc[] in the heap turns into a long[]. Besides taking only 8 bytes/element instead of 28, there is a much higher chance of CPU L2 cache hits as the longs in a long[] are packed tightly together on the heap.

Handling the case of a not-fully-filled heap is simple: Just add the elements to the array at the next free space. If the array gets full, heap-sort it and continue using it as a heap. If the queue does not run full, use the faster merge-sort to deliver the results in order.

Explorative experimentation

Which are just fancy words for poor methodology. Unfortunately the HitQueue used in Solr is not easily replaceable (it extends PriorityQueue, which has a lot of non-overridable methods). So to get a gist of which ideas are worth pursuing, we turn to micro benchmarks. Shalin Shekhar Mangar suggested using JMH. A sound advice which is postponed for now in favour of “just run it a lot of times, alternating between implementations”.

The test bench is simple: Start a bunch of Threads, each running a test in parallel. Each test instantiates a queue implementation, fills it with random docID/score-pairs, then empties it. To guard against noisy neighbours, all threads tests the same implementation and finish fully, before switching to the next implementation.

For completeness, a Solr HitQueue sans sentinels is also tested. Spoiler: Turning off the sentinel looks like an extremely easy win for large top-X requests.

Experimental results – the original hypothesis

The idea was to improve on large Top-X, so let’s look at top-1M. The test was with 1, 4 & 16 threads on a quad-core i7 laptop and with varying amounts of hits. The raw output of the testPQPerformanceReport1M follows:

Threads       Top-X      Hits      Sentinel    No_Sentinel         Packed 
      1     1000000        10    15.06 100%      0.50   3%■     0.97   6% 
      1     1000000       100    13.56 100%      0.47   3%■     0.97   7% 
      1     1000000      1000    13.28 100%      0.56   4%■     1.20   9% 
      1     1000000     10000    21.21 100%      1.28   6%■     1.58   7% 
      1     1000000    100000    86.04 100%     18.43  21%      8.34  10%■
      1     1000000   1000000   954.23 100%    447.40  47%     87.39   9%■
      1     1000000  10000000  3245.70 100%   3544.06 109%    895.89  28%■

      4     1000000        10    27.31 100%      1.47   5%■     3.35  12% 
      4     1000000       100    25.68 100%      1.58   6%■     2.97  12% 
      4     1000000      1000    25.79 100%      1.45   6%■     3.04  12% 
      4     1000000     10000    33.42 100%      2.27   7%■     2.95   9% 
      4     1000000    100000   119.99 100%     19.50  16%     11.52  10%■
      4     1000000   1000000  1456.82 100%    576.17  40%    134.46   9%■
      4     1000000  10000000  5934.11 100%   4278.37  72%   1385.38  23%■

     16     1000000        10   131.92 100%      3.26   2%■     9.79   7% 
     16     1000000       100   120.81 100%      4.08   3%■     8.76   7% 
     16     1000000      1000   124.63 100%      3.01   2%■    10.30   8% 
     16     1000000     10000   162.60 100%      4.68   3%■    10.49   6% 
     16     1000000    100000   485.46 100%     84.27  17%     32.81   7%■
     16     1000000   1000000  4702.79 100%   1787.32  38%    368.57   8%■
     16     1000000  10000000 16563.52 100%  10964.12  66%   4197.17  25%■

Below each implementation (Sentinel, No_Sentinel and Packed) are two columns: How many milliseconds it took to initialize, fill & empty the queue, followed by how long it took relative to the vanilla Solr Sentinel implementation (this is of course 100% for the Sentinel implementation itself). The fastest implementation for any given test is marked with a black box ■.

  • Sentinel starts out relatively slow for small amounts of Hits and gets dog-slow, when the amount of Hits gets to 1M+.
  • No_Sentinel is a lot better for the smaller hit counts (it does not have to do all the initialization) and markedly better up to 1M Hits.
  • Packed is slower than No_Sentinel for the smaller hit counts, but as the number of hits rises, it pulls markedly ahead. There really is no contest beyond 100K hits, where Packed is 3-10x as fast as vanilla Solr Sentinel.
    Notice how there is a relative performance hit, when the amount of Hits exceeds the queue size for Packed. This it where it switches from “just fill the array sequentially, then merge-sort at the end” to maintaining a heap.

Experimental results – small top-X requests

A superior implementation for large top-X requests is nice and it is certainly possible to choose implementation based on the value of top-X. But let’s check to see how it behaves for small top-X requests. Top-10 for instance:

Threads       Top-X      Hits      Sentinel    No_Sentinel         Packed 
      1          10        10     0.01 100%      0.01  49%      0.00  45%■
      1          10       100     0.01 100%      0.01  85%      0.01  81%■
      1          10      1000     0.01 100%■     0.02 204%      0.02 219% 
      1          10     10000     0.05 100%■     0.05 101%      0.08 152% 
      1          10    100000     0.59 100%      0.44  74%      0.37  62%■
      1          10   1000000     4.41 100%      4.53 103%      3.64  83%■
      1          10  10000000    42.15 100%     45.37 108%     36.33  86%■

      4          10        10     0.00 100%■     0.00 154%      0.01 329% 
      4          10       100     0.00 100%      0.01 348%      0.00  85%■
      4          10      1000     0.01 100%      0.01 120%      0.01  79%■
      4          10     10000     0.05 100%■     0.07 138%      0.08 154% 
      4          10    100000     0.47 100%      0.46  99%■     0.55 117% 
      4          10   1000000     4.71 100%      6.29 134%      4.45  95%■
      4          10  10000000    72.23 100%     60.09  83%     56.29  78%■

     16          10        10     0.00 100%      0.00  42%      0.00  39%■
     16          10       100     0.00 100%      0.00  72%      0.00  60%■
     16          10      1000     0.01 100%■     0.01 109%      0.01 128% 
     16          10     10000     0.08 100%      0.09 112%      0.07  80%■
     16          10    100000     1.48 100%      1.32  89%      1.12  76%■
     16          10   1000000    17.63 100%     18.74 106%     16.97  96%■
     16          10  10000000   207.65 100%    212.93 103%    192.81  93%■

Measurements for small number of Hits are somewhat erratic. That is to be expected as those tests are very fast (< 1ms) to complete, so tiny variations on machine load or garbage collection has a lot to say.

  • Sentinel is as expected very fast, compared to the top-1M test. No surprise there as asking for top-10 (or top-20) is the core request for Solr. The relative speed stays reasonably constant as the number of Hits grows.
  • No_Sentinel is a bit slower than Sentinel. That can be explained by a slightly different code path for insertion of elements – this should be investigated as there is a possibility of an easy optimization.
  • Packed is very interesting: Although it is not consistently better for the lower numbers of Hits (do remember we are still talking sub-millisecond times here), it is consistently a bit faster for the larger ones. There might not be a need for choosing between implementations.

Conclusion

Replacing the ScoreDoc-object using HitQueue in Solr with a bit-packing equivalent is a winning strategy in this test: The speed-up is substantial in some scenarios and the memory usage is ⅓rd of vanilla Solr.

The road ahead

This should of course be independently verified, preferably on other machine architectures. It should be investigated how to handle the case where shardIndex is relevant and more importantly discussed how to adjust the Lucene/Solr search code to use an alternative HitQueue implementation.

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, Hacking, Low-level, Lucene, open source, Performance, Solr, Uncategorized. Bookmark the permalink.

One Response to Speeding up core search

  1. Pingback: The ones that got away | Software Development at Statsbiblioteket

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s