The ones that got away

November 13, 2015 by

Two and a half ideas of improving Lucene/Solr performance that did not work out.

Track the result set bits

At the heart of Lucene (and consequently also Solr and ElasticSearch), there is a great amount of doc ID set handling. Abstractly speaking, a doc ID set is a collection of internal document IDs. If there are 10 documents in the index and a query matches the first and the fourth, the set would be [0, 3] as we count from 0. Every time there is filtering or faceting, bit sets are used.

SortedIntDocSet in Solr is one example of a document ID set. The IDs are stored after each other in an int[]. This takes up 32 bits per document in the set and is used for small result sets only.

FixedBitSet in Lucene is another example. It reserves 1 bit for each document in the full corpus. If the bit for a document is set, it means that the document is part of the set. All the bits are stored in a long[], which makes most operations quite fast: To calculate the union, the relevant code is

int pos = Math.min(numWords, otherNumWords);
while (--pos >= 0) {
  thisArr[pos] |= otherArr[pos];

FixedBitSet is very widely used. It is referenced in the Lucene/Solr code 451 times! As opposed to SortedIntDocSet et al, it can contain result sets of any size and reading or changing a specific bit is O(1). It does have a weakness though: Sparse sets.

Nearly all operations on FixedBitSet takes time linear to the total number of documents in the index, not to the number of documents in the set. Calculating the union of [5, 87] and [1, 5, 1056978], when there are 20 million documents in the index, requires stepping through all the 300,000 longs (20M / 64) in each array to check for set bits.

Idea: Tracking changes to the structure works really well for sparse faceting, so why not for FixedBitSets? Add a new tracking bitset where each bit corresponds to a single long in the underlying FixedBitSet. If the tracking bit is set, the FixedBitSet long contains at least one set bit. To locale set bits, iterate the tracking bitset and look for longs != 0. This reduces the number of unnecessary lookups by a factor 64. Of course a third bitset can be added on top of the two, reducing unnecessary lookups by a total factor of 4096.

An implementation is available at github with a performance test. Instructions for running is in the TRACKED_README.txt file.  Some reduced sample runs, where open if the vanilla FixedBitSet and tracked is the tracked version:

2K / 20M bits set (0.01% filled)
test         impl        ms
union        open      0.42
union        tracked   0.16
cardinality  open      0.34
cardinality  tracked   0.09
icount       open      0.59
icount       tracked   0.57
nextSetBit   open      0.31
nextSetBit   tracked   0.12
iterator     open      0.32
iterator     tracked   0.14

20K / 20M bits set (0.1% filled)
test         impl        ms
union        open      0.81
union        tracked   1.08
cardinality  open      0.34
cardinality  tracked   0.49
icount       open      0.58
icount       tracked   2.18
nextSetBit   open      0.58
nextSetBit   tracked   0.79
iterator     open      0.68
iterator     tracked   0.80

200K / 20M bits set (1.0% filled)
test         impl        ms
union        open      3.51
union        tracked   4.93
cardinality  open      0.43
cardinality  tracked   1.48
icount       open      0.59
icount       tracked  10.26
nextSetBit   open      3.21
nextSetBit   tracked   4.24
iterator     open      3.98
iterator     tracked   2.54

Technically the implementation delivers: Sparse result sets are processed faster. Unfortunately most operations require the sets to be less than 0.1% filled and worse, the overhead for non-sparse bitsets is prohibitive. There might be some special places where the tracked bitset can be used, but it is not viable as a general replacement for FixedBitSet.

Better scaling of top-X searches

This idea got pretty far with LUCENE-6828 JIRA and an analysis of the speed up.

Requesting the top-X relevance ranked results from Lucene/Solr is pretty much the core use case. The underlying implementation uses a binary heap based priority queue with Java Objects. Due to the cache-adverse nature of binary heaps and the large overhead of tiny Objects, this does not scale well: Do not ask top top-1M hits and really do not ask for top-100M.

Using an array based bit packed heap implementation yields superior space- and speed-performance for top-100K+ requests (3x+ for both memory and speed, plus a lot less GC activity; see the previously linked analysis for details), with the advantages growing with the number of concurrent requests. No win for small top-X requests, but no big loss either.

So success? Sorta, kinda. Turns out that there are generally two scenarios:

  1. Request a relatively modest top-X, such af top-10 or top-1000.
  2. Export the <em>full</em> result set, which can be millions or billions.

As the array-based heap has no advantage over the vanilla Solr heap for small result sets and as full exports are much better handled by Solr streaming export, there is near-zero need for the middle ground that a tracked bitset provides. It is a solution looking for a problem.

Cache-friendly priority queue

Since the vanilla Solr binary heap is cache-adverse, why not try and fix that problem? Yes, totally ignoring the “there is no problem as nobody requests large top-X results”-part here.

The idea is fairly simple: Binary heaps works well when the heap can be fully contained in the CPU cache, so making a heap of heaps means that most operations will be done on data that are already cached. There is a very nice write up on this at Playful Programming: Cache optimizing a priority queue.

This was implemented in Java at the hitdocs-trunk branch. Performance tests are in Measurements below are from testPQPerformanceAtomicHeapsLimited, where sentinel is vanilla Solr, packed is the array-based heap described above and bheap4 is the best performing heaps-within-heaps implementation. Lower percentages are best and the winning implementation for each test is marked with ■.

Threads       Top-X      Hits      Sentinel         Packed         BHeap4 
      1     1000000    100000   118.49 100%     11.06   9%■    20.47  17% 
      1     1000000   1000000  1356.56 100%    133.51  10%■   270.82  20% 
      1     1000000   2000000  2216.50 100%    566.67  26%    472.43  21%■
      1     1000000   5000000  3406.35 100%    827.23  24%    771.33  23%■
      1     1000000  10000000  4183.84 100%   1054.94  25%   1012.78  24%■
      4     1000000    100000   246.00 100%     22.38   9%■    41.73  17% 
      4     1000000   1000000  2154.64 100%    232.00  11%■   504.80  23% 
      4     1000000   2000000  3356.96 100%   1057.01  31%    772.90  23%■
      4     1000000   5000000  5022.50 100%   1484.44  30%   1227.93  24%■
      4     1000000  10000000  6503.33 100%   1910.10  29%   1891.28  29%■
     16     1000000    100000  1310.95 100%     68.29   5%■   119.87   9% 
     16     1000000   1000000  6731.49 100%    605.90   9%■  1405.25  21% 
     16     1000000   2000000 11882.59 100%   3713.82  31%   2744.36  23%■
     16     1000000   5000000 19589.75 100%   5389.25  28%   4447.21  23%■
     16     1000000  10000000 25522.07 100%   6921.96  27%   4843.61  19%■

Note: The cases where hits <= top-X (100K and 1M) are special as packed sorts result sets with hits <= top-X using the more cache-friendly merge sort. This explains why packed is twice as fast as bheap4 for these cases. A similar optimization could be implemented for bheap4.

Looking through rose-tinted glasses, bheap4 is a bit better than packed (and they are both vastly superior to sentinel). But the implementation of bheap4 is bordering on voodoo, with the multi-level offset calculations and all the bit-tricks needed to avoid doing expensive multiplications and modulo calculations.

Hopefully the speed of bheap can be improved further – the Playful Programming author Björn Fahller certainly got fine results with his C++ implementation. But until that happens, its miniscule gains are not worth the added complexity.

Speeding up core search

October 5, 2015 by

Issue a query, get back the top-X results. It does not get more basic with Solr. So great win if we can improve on that, right? Truth be told, the answer is still “maybe”, but read on for some thoughts, test code and experimental results.

Getting the top-X results

  • The X in top-X is the maximum result set size we want back from a request.
  • A result set is the documents actually returned – this can be the same or less than X.
  • hits is the number of documents that matched the query – this can be much higher than the size of the result set.

When a top-X search is performed, Solr uses a Priority Queue to keep track of the result set. Such a queue is often implemented as a heap and so it is with Solr: The relevant class in the source is HitQueue.

A ScoreDoc represents a single result document, in the form of a document ID and a score (and sometimes a shard index, but we ignore that for now). The HitQueue holds ScoreDocs and is a simple beast: Add a new ScoreDoc and it will check to see if it has a higher score than any of the existing ScoreDocs. If so, the new ScoreDoc is added. If the queue is full, the ScoreDoc with the lowest score is removed.

Potential problems with the current implementation

  • A simple binary heap has poor memory locality: This means that an insert into the heap often results in multiple memory accesses scattered in the heap structure. This is not a problem for a tiny heap as everything will be in the CPU L2 cache, but for larger heaps that means a lot of uncached memory accesses.
  • The HitQueue in Solr uses ScoreDoc objects as elements in the heap. When the HitQueue is created, it is filled with dummy-ScoreDocs, called sentinels. There are multiple problems with that.
    • Each ScoreDoc object takes up 28 bytes on the heap. If 30 concurrent requests asks for top-2500000, that takes up 2GB of RAM, no matter what the actual result size is. I mention those specific numbers as that was the case for a person on the solr IRC channel.
    • Each ScoreDoc object is temporary, which means a lot of allocations and a lot of work for the garbage collector to clean up. In the previous case of the top-2500000, the JVM was doing stop-the-world garbage collections half the time.
    • The use of sentinel objects means that the heap is pre-filled with elements that will always be less than any real elements. Adding an element means sifting it from the top of the heap to the bottom. Not a problem for small heaps, but with larger ones it means unnecessary memory thrashing until the heap is full of new elements.
  • If the queue is not filled to its max size (the X in top-X), the ongoing maintaining of a heap structure is not the most efficient solution. For that, it would be better to simply collect all elements and do a merge-sort or similar, when they are to be delivered in order.

To sum it up: Solr’s queue in theory works well when requesting a small number of results, but poorly for large numbers. And indeed, that is part of the shared experience with Solr.

Lately CursorMark has been added to Lucene & Solr, which allows for efficient paging of results. One could say that this renders the optimization of requests for large result sets is a moot point, but hey, it’s a challenge.

Experimental improvements

Switching to a better heap algorithm would be a worthy experiment, but that is hard so we will skip that for now. Instead we will do away with all those pesky ScoreDoc objects.

The two relevant parts of a ScoreDoc are docID (an integer) and score ( a float). The sorting order of two ScoreDocs is determined primarily by the score and secondarily by the docID. There are a lot of such comparisons when using a heap.

In Java, the bits of a float can be extracted by Float.floatToRawIntBits, which produces an integer. This is a very fast operation. Interestingly enough, the sort order of two positive floats is preserved in the integer representations. This means that a ScoreDoc can be converted into an atomic long with Float.floatToRawIntBits(score) << 32 | docID and that two such longs are directly comparable.

With each element as a long, the object-guzzling ScoreDoc[] in the heap turns into a long[]. Besides taking only 8 bytes/element instead of 28, there is a much higher chance of CPU L2 cache hits as the longs in a long[] are packed tightly together on the heap.

Handling the case of a not-fully-filled heap is simple: Just add the elements to the array at the next free space. If the array gets full, heap-sort it and continue using it as a heap. If the queue does not run full, use the faster merge-sort to deliver the results in order.

Explorative experimentation

Which are just fancy words for poor methodology. Unfortunately the HitQueue used in Solr is not easily replaceable (it extends PriorityQueue, which has a lot of non-overridable methods). So to get a gist of which ideas are worth pursuing, we turn to micro benchmarks. Shalin Shekhar Mangar suggested using JMH. A sound advice which is postponed for now in favour of “just run it a lot of times, alternating between implementations”.

The test bench is simple: Start a bunch of Threads, each running a test in parallel. Each test instantiates a queue implementation, fills it with random docID/score-pairs, then empties it. To guard against noisy neighbours, all threads tests the same implementation and finish fully, before switching to the next implementation.

For completeness, a Solr HitQueue sans sentinels is also tested. Spoiler: Turning off the sentinel looks like an extremely easy win for large top-X requests.

Experimental results – the original hypothesis

The idea was to improve on large Top-X, so let’s look at top-1M. The test was with 1, 4 & 16 threads on a quad-core i7 laptop and with varying amounts of hits. The raw output of the testPQPerformanceReport1M follows:

Threads       Top-X      Hits      Sentinel    No_Sentinel         Packed 
      1     1000000        10    15.06 100%      0.50   3%■     0.97   6% 
      1     1000000       100    13.56 100%      0.47   3%■     0.97   7% 
      1     1000000      1000    13.28 100%      0.56   4%■     1.20   9% 
      1     1000000     10000    21.21 100%      1.28   6%■     1.58   7% 
      1     1000000    100000    86.04 100%     18.43  21%      8.34  10%■
      1     1000000   1000000   954.23 100%    447.40  47%     87.39   9%■
      1     1000000  10000000  3245.70 100%   3544.06 109%    895.89  28%■

      4     1000000        10    27.31 100%      1.47   5%■     3.35  12% 
      4     1000000       100    25.68 100%      1.58   6%■     2.97  12% 
      4     1000000      1000    25.79 100%      1.45   6%■     3.04  12% 
      4     1000000     10000    33.42 100%      2.27   7%■     2.95   9% 
      4     1000000    100000   119.99 100%     19.50  16%     11.52  10%■
      4     1000000   1000000  1456.82 100%    576.17  40%    134.46   9%■
      4     1000000  10000000  5934.11 100%   4278.37  72%   1385.38  23%■

     16     1000000        10   131.92 100%      3.26   2%■     9.79   7% 
     16     1000000       100   120.81 100%      4.08   3%■     8.76   7% 
     16     1000000      1000   124.63 100%      3.01   2%■    10.30   8% 
     16     1000000     10000   162.60 100%      4.68   3%■    10.49   6% 
     16     1000000    100000   485.46 100%     84.27  17%     32.81   7%■
     16     1000000   1000000  4702.79 100%   1787.32  38%    368.57   8%■
     16     1000000  10000000 16563.52 100%  10964.12  66%   4197.17  25%■

Below each implementation (Sentinel, No_Sentinel and Packed) are two columns: How many milliseconds it took to initialize, fill & empty the queue, followed by how long it took relative to the vanilla Solr Sentinel implementation (this is of course 100% for the Sentinel implementation itself). The fastest implementation for any given test is marked with a black box ■.

  • Sentinel starts out relatively slow for small amounts of Hits and gets dog-slow, when the amount of Hits gets to 1M+.
  • No_Sentinel is a lot better for the smaller hit counts (it does not have to do all the initialization) and markedly better up to 1M Hits.
  • Packed is slower than No_Sentinel for the smaller hit counts, but as the number of hits rises, it pulls markedly ahead. There really is no contest beyond 100K hits, where Packed is 3-10x as fast as vanilla Solr Sentinel.
    Notice how there is a relative performance hit, when the amount of Hits exceeds the queue size for Packed. This it where it switches from “just fill the array sequentially, then merge-sort at the end” to maintaining a heap.

Experimental results – small top-X requests

A superior implementation for large top-X requests is nice and it is certainly possible to choose implementation based on the value of top-X. But let’s check to see how it behaves for small top-X requests. Top-10 for instance:

Threads       Top-X      Hits      Sentinel    No_Sentinel         Packed 
      1          10        10     0.01 100%      0.01  49%      0.00  45%■
      1          10       100     0.01 100%      0.01  85%      0.01  81%■
      1          10      1000     0.01 100%■     0.02 204%      0.02 219% 
      1          10     10000     0.05 100%■     0.05 101%      0.08 152% 
      1          10    100000     0.59 100%      0.44  74%      0.37  62%■
      1          10   1000000     4.41 100%      4.53 103%      3.64  83%■
      1          10  10000000    42.15 100%     45.37 108%     36.33  86%■

      4          10        10     0.00 100%■     0.00 154%      0.01 329% 
      4          10       100     0.00 100%      0.01 348%      0.00  85%■
      4          10      1000     0.01 100%      0.01 120%      0.01  79%■
      4          10     10000     0.05 100%■     0.07 138%      0.08 154% 
      4          10    100000     0.47 100%      0.46  99%■     0.55 117% 
      4          10   1000000     4.71 100%      6.29 134%      4.45  95%■
      4          10  10000000    72.23 100%     60.09  83%     56.29  78%■

     16          10        10     0.00 100%      0.00  42%      0.00  39%■
     16          10       100     0.00 100%      0.00  72%      0.00  60%■
     16          10      1000     0.01 100%■     0.01 109%      0.01 128% 
     16          10     10000     0.08 100%      0.09 112%      0.07  80%■
     16          10    100000     1.48 100%      1.32  89%      1.12  76%■
     16          10   1000000    17.63 100%     18.74 106%     16.97  96%■
     16          10  10000000   207.65 100%    212.93 103%    192.81  93%■

Measurements for small number of Hits are somewhat erratic. That is to be expected as those tests are very fast (< 1ms) to complete, so tiny variations on machine load or garbage collection has a lot to say.

  • Sentinel is as expected very fast, compared to the top-1M test. No surprise there as asking for top-10 (or top-20) is the core request for Solr. The relative speed stays reasonably constant as the number of Hits grows.
  • No_Sentinel is a bit slower than Sentinel. That can be explained by a slightly different code path for insertion of elements – this should be investigated as there is a possibility of an easy optimization.
  • Packed is very interesting: Although it is not consistently better for the lower numbers of Hits (do remember we are still talking sub-millisecond times here), it is consistently a bit faster for the larger ones. There might not be a need for choosing between implementations.


Replacing the ScoreDoc-object using HitQueue in Solr with a bit-packing equivalent is a winning strategy in this test: The speed-up is substantial in some scenarios and the memory usage is ⅓rd of vanilla Solr.

The road ahead

This should of course be independently verified, preferably on other machine architectures. It should be investigated how to handle the case where shardIndex is relevant and more importantly discussed how to adjust the Lucene/Solr search code to use an alternative HitQueue implementation.

Light validation of Solr configuration

September 23, 2015 by

This week we were once again visited by the Edismax field alias bug in Solr: Searches with boosts, such as foo^2.5, stopped working. The problem arises when an alias with one or more non-existing fields is defined in solrconfig.xml and it is tedious to track down as one needs to check for existence of all the fields referenced.

We have a 10+ different Solr setups and we use aliases in most of them. So a quick script was whipped together:, which (…wait for it…) validates Solr configs. Nothing fancy and it tends to report false positives when things are commented out in the XML files. Still, it does check that

  • all fields in schema.xml references existing field types
  • all copyFields in schema.xml references existing fields
  • all fields referenced in solrconfig.xml are defined in schema.xml
  • no alias in has the same name as a field in schema.xml

Some of these problems, such as referencing a non-existing fields in mlt.fl or pf in solrconfig.xml, are silent and hard to track down: Solr does not complain and searches seem to work. But in the case of misspellings of field names, the result is poorer quality searches as the intended functionality is not activated.

Cross-validation of fields used in solrconfig.xml and schema.xml would be nice to have as part of Solr core startup, but until then the script might be of use. Get it at GitHub.

Sampling methods for heuristic faceting

July 30, 2015 by

Initial experiments with heuristic faceting in Solr were encouraging: Using just a sample of the result set, it was possible to get correct facet results for large result sets, reducing processing time by an order of magnitude. Alas, further experimentation unearthed that the sampling method was vulnerable to clustering. While heuristic faceting worked extremely well for most of the queries, it failed equally hard for a few of the queries.

The problem

Abstractly, faceting on Strings is a function that turns a collection of documents into a list of top-X terms plus the number of occurrences of these terms. In Solr the collection of documents is represented with a bitmap: One bit per document; if the bit is set, the document is part of the result set. The result set of 13 hits for an index with 64 documents could look like this:

00001100 01010111 00000000 01111110

Normally the faceting code would iterate all the bits, get the terms for the ones that are set and update the counts for those terms. The iteration of the bits is quite fast (1 second for 100M bits), but getting the terms (technically the term ordinals) and updating the counters takes more time (100 seconds for 100M documents).

Initial attempt: Sample the full document bitmap

The initial sampling was done by dividing the result set into chunks and only visiting those chunks. If we wanted to sample 50% of our result set and wanted to use 4 chunks, the parts of the result set to visit could be the one marked with red:

4 chunks: 00001100 01111110 00000000 01010111

As can be counted, the sampling hit 5 documents out of 13. Had we used 2 chunks, the result could be

2 chunks: 00001100 01111110 00000000 01010111

Only 2 hits out of 13 and not very representative. A high chunk count is needed: For 100M documents, 100K chunks worked fairly well. The law of large numbers helps a lot, but in case of document clusters (a group of very similar documents indexed at the same time) we still need both a lot of chunks and a high sampling percentage to have a high chance of hitting them. This sampling is prone to completely missing or over representing clusters.

Current solution: Sample the hits

Remember that iterating of the result bitmap itself is relatively fast. Instead of processing chunks of the bitmap and skipping between them, we iterate over all the hits and only update counts for some of them.

If the sampling rate is 50%, the bits marked with red would be used as sample:

50% sampling: 00001100 01111110 00000000 01010111

If the sampling rate is 33%, the bits for the sample documents would be

33% sampling: 00001100 01111110 00000000 01010111

This way of sampling is a bit slower than sampling on the full document bitmap as all bits must be visited, but it means that the distribution of the sampling points is as fine-grained as possible. It turns out that the better distribution gives better results, which means that the size of the sample can be lowered. Lower sample rate = higher speed.

Testing validity

A single shard from the Net Archive Search was used for testing. The shard was 900GB with 250M documents. Faceting was performed on the field links, which contains all outgoing links from indexed webpages. There are 600M unique values in that field and each document in the index contains an average of 25 links. For a full search on *:* that means 6 billion updates of the counter structure.

For this test, we look for the top-25 links. To get the baseline, a full facet count was issued for the top-50 links for a set of queries. A heuristic facet call was issued for the same queries, also for the top-50. The number of lines until the first discrepancy were counted for all the pairs. The ones with a count beneath 25 were considered faulty. The reason for the over provisioning was to raise the probability of correct results, which of course comes with a performance penalty.

The sampling size was set to 1/1000 the number of documents or roughly 200K hits. Only result sets sizes above 1M are relevant for validity as those below takes roughly the same time to calculate with and without sampling.

Heuristic validity for top 25/50

Heuristic validity for top 25/50

While the result looks messy, the number of faulty results was only 6 out of 116, for results set sizes above 1M. For the other 110 searches, the top-25 fields were correct. Raising the over provisioning to top-100 imposes a larger performance hit, but reduces the number of faulty results to 0 for this test.

Heuristic validity for top 25/100

Heuristic validity for top 25/100

Testing performance

The response times for full count faceting and heuristic faceting on the links field with over provision of 50 is as follows:

Heuristic speed for top 25/50

Heuristic speed for top 25/50

Switching from linear to logarithmic plotting for the y-axis immediately:

Heuristic speed for top 25/50, logarithmic Y-axis

Heuristic speed for top 25/50, logarithmic y-axis

It can be seen full counting rises linear with result size, while sampling time is near-constant. This makes sense as the sampling was done by updating counts for a fixed amount of documents. Other strategies, such as making the sampling rate a fraction of the result size, should be explored further, but as the validity plot shows, the fixed strategy works quite well.

The performance chart for over provisioning of 100 looks very much like the one for 50, only with slightly higher response times for sampling. As the amount of non-valid results is markedly lower for an over provisioning of 100, this seems like the best speed/validity trade off for our concrete setup.

Heuristic speed for top 25/100, logarithmic Y-axis

Heuristic speed for top 25/100, logarithmic y-axis


Heuristic faceting with sampling on hits gives a high probability of correct results. The speed up relative to full facet counting rises with result set size as sampling has near-constant response times. Using over provisioning allows for fine-grained tweaking between performance and chance of correct results. Heuristic faceting is expected to be the default for interactive use with the links field. Viability of heuristic faceting for smaller fields is currently being investigated.

As always, there is full source code and a drop-in sparse faceting Solr 4.10 WAR at GitHub.

Dubious guesses, counted correctly

June 19, 2015 by

We do have a bit of a performance challenge with heavy faceting on large result sets in our Solr based Net Archive Search. The usual query speed is < 2 seconds, but if the user requests aggregations based on large result sets, such as all resources from a whole year, processing time jumps to minutes. To get an idea of how bad it is, here’s a chart for response times when faceting on a field with 640M unique values.

Full faceting on field links

Faceting performance for field links with 600M unique values on a 900GB / 250M document index

Yes, the 80M hits query does take 16 minutes! As outlined in Heuristically correct top-X facets, it seems possible to use sampling to determine the top-X terms of the facet result and then fine count only those terms. The first version of heuristically correct top-X facets has now been implemented (download the latest Sparse faceting WAR to try it out), so time for evaluation.

Three facet fields

For this small scale evaluation we use just a single 900GB shard with 250M documents, generated from harvested web resources. The three fields of interests are

  • domain, with 1 value/document and 1.1M unique values. Of these, 230K are only referenced by a single document. The most popular domains are referenced by 4M documents.
    Intuitively, domain seems fitting for sampling, with relatively few unique values, not too many single instance values and a high amount of popular domains.
  • url, with 1 value/document and 200M unique values. Of these, 185M are only referenced by a single document. The most popular urls are referenced by 65K documents.
    Contrary to domain, url seems more problematic to sample, with relatively many unique values, a great deal of single value instances and not very many popular urls.
  • links, with 10 values/document and 600M unique values. Of these, 420M are only referenced by a single document. The most popular links are referenced by 8M documents.
    In between domain and url is links, with relatively many unique values, but only 10% of the 6 billion references being to single instance values and a with high amount of popular links.


Caveat lector: This test should not be seen as authoritative, but rather an indicator of trade-offs. It was done on a heavy loaded machine, so real-world performance should be better. However, the relative differences in speed should not be to far off (tested ad hoc at a time where the machine was not under heavy load).

11 very popular terms were extracted from the general text field and used as query term, to simulate queries, heavy in terms of the number of hits.

Term Hits
og 77M
a 54M
10 50M
to 45M
ikke 40M
søg 33M
denne 25M
også 22M
under 18M
telefon 10M
indkøbskurv  7M

The top 25 terms were requested with facet.limit=25 and sampling was performed by using only part of the result set to update the facet counters. The sampling was controlled by 2 options:

  • fraction (facet.sparse.heuristic.fraction=0.xx): How much of the total number of documents to sample. If fraction is 0.01, this means 1% or 0.01*250M = 2.5M documents. Note that these are all the documents, not only the ones in the result set!
  • chunks (facet.sparse.heuristic.sample.chunks=xxx): How many chunks to split the sampling in. If chunks is 10 and fraction is 0.01, the 2.5M sample documents will be checked by visiting the first 250K, skipping ahead, visiting another 250K etc. 10 times.

To get a measure of validity, a full count was performed for each facet with each search term. The result from the samples runs were then compared to the full count, by counting the number of correct terms from the top to the first error. Example: If the fully counted result is

  • a (100)
  • b (80)
  • c (50)
  • d (20)
  • e (20)

and the sample result is

  • a (100)
  • b (80)
  • c (50)
  • e (20)
  • f (18)

then the score would be 3. Note that the counts themselves are guaranteed to be correct. Only the terms are unreliable.


Facet field domain (1.1M unique values, 1 value/document)

First we sample using half of all documents (sample fraction 0.5), for varying amounts of chunks: c10 means 10 chunks, c10K means 10000 chunks. As facet.limit=25, highest possible validity score is 25. Scores below 10 are marked with red, scores from 10-19 are marked purple.

Term Hits c10 c100 c1K c10K c100K
og 77M 19 9 25 25 25
a 54M 20 4 25 25 25
10 50M 20 5 25 25 25
to 45M 18 14 25 25 25
ikke 40M 16 15 25 25 25
søg 33M 16 15 23 25 24
denne 25M 17 18 23 24 25
også 22M 17 12 25 25 25
under 18M 4 12 23 23 25
telefon 10M 16 8 23 23 25
indkøbskurv 7M 8 2 16 21 25
Validity plot for field domain

Heuristic faceting for field domain with 50% sampling

Looking at this, it seems that c1k (1000 chunks) is good, except for the last term indkøbskurv, and really good for 10000 chunks. Alas, sampling with half the data is nearly the full work.

Looking at a sample fraction of 0.01 (1% of total size) is more interesting:

Term Hits c10 c100 c1K c10K c100K
og 77M 4 9 24 23 25
a 54M 4 4 23 24 25
10 50M 3 4 23 25 20
to 45M 0 0 24 24 24
ikke 40M 5 13 25 24 25
søg 33M 0 0 20 21 25
denne 25M 0 0 18 22 23
også 22M 6 12 23 25 25
under 18M 3 4 22 23 24
telefon 10M 5 7 12 12 25
indkøbskurv 7M 0 1 4 16 23
Vailidty of field domain

Heuristic faceting for field domain with 1% sampling

Here it seems that c10K is good and c100K is really good, using only 1% of the documents for sampling. If we were only interested in the top-10 terms, the over-provisioning call for top-25 would yield valid results for both c10k and c100k. If we want all top-25 terms to be correct, over-provisioning to top-50 or something like that should work.

The results are viable, even with a 1% sample size, provided that the number of chunks is high enough. So how fast is it to perform heuristic faceting, as opposed to full count?

Faceting performance for field domain with 1% sampling

Faceting performance for field domain with 1% sampling

The blue line represents the standard full counting faceting, no sampling. It grows linear with result size, with worst case being 14 seconds. Sample based counting (all the other lines) also grows linear, but with worst case at 2 seconds. Furthermore the speed difference between the number of chunks is so small that choosing 100K chunks, and thereby the best chance of getting the viable results, is not a problem.

In short: Heuristic faceting on the domain field for large result sets is 4-7 times faster than standard counting, with a high degree of viability.

Facet field url (200M unique values, 1 value/document)

Validity of field url at 1% sampling

Heuristic faceting for field url with 1% sampling

Faceting performance for field url with 1% sampling

Faceting performance for field url with 1% sampling

The speed up is a modest 2-4 times for the url field, but worse the viability is low, even when using 100000 chunks. Raising the minimum result set size for heuristic faceting to 20M hits could conceivably work, but the url field still seems a poor fit. Considering that the url field does not have very many recurring values, this is not too surprising.

Facet field links (600M unique values, 10 values/document)

Heuristic faceting for field links with 1% sampling

Heuristic faceting for field links with 1% sampling

Faceting performance for field links with 1% sampling

Faceting performance for field links with 1% sampling

The heuristic viability of the links field is just as good as with the domain field: As long af the number of chunks is above 1000, sampling with 1% yields great results. The performance is 10-30 times that of standard counting. This means that the links field is an exceptionally well fit for heuristic faceting.

Removing the full count from the chart above reveals that worst-case in this setup is 22 seconds. Not bad for a result set of 77M documents, each with 10 references to any of 600M values:

Faceting performance for field links with 1% sampling

Faceting performance for field links with 1% sampling, no baseline shown


Heuristically correct faceting for large result sets allows us to reduce the runtime of our heaviest queries by an order of magnitude. Viability and relative performance is heavily dictated by the term count distribution for the concrete fields (the url field was a poor fit) and by cardinality. Anyone considering heuristic faceting should test viability on their corpus before enabling it.

Word of caution

Heuristic faceting as part of Solr sparse faceting is very new and not tested in production. It is also somewhat rough on the edges; simple features such as automatic over-provisioning has not been implemented yet.

Net Archive Search building blocks

June 9, 2015 by

An extremely webarchive-discovery and Statsbiblioteket centric description of some of the technical possibilities with Net Archive Search. This could be considered internal documentation, but we like to share.

There are currently 2 generations of indexes at Statsbiblioteket: v1 (22TB) & v2 (8TB). External access is currently to v1. As the features of v2 is a superset of v1, v1 will be disabled as soon as v2 catches up in terms of the amount of indexed material. ETA: July or August 2015.

Aggregation fields

The following fields has the DocValues-option enabled, meaning that it is possible to export them efficiently as well as doing sorting, grouping and faceting at a low memory cost.


Field v1 v2 Description
url_norm * The resource URL, lightly normalised (lowercased, www removed, etc.) the same way as links. The 2 fields together can be used to generate graphs of resource interconnections.

This field is also recommended for grouping. To get unique URLs in their earliest versions, do


links * Outgoing links from web pages, normalised the same way as url_norm. As cardinality is non-trivial (~600M values per TB of index), it is strongly recommended to enable low-memory mode if this field is used for faceting:
host * * The host part of the URL for the resource. Example:
domain * * The domain part of the URL for the resource. Example:
links_hosts * The host part of all outgoing links from web pages.
links_domain * The domain part of all outgoing links from web pages.
links_public_suffixes * The suffix of all outgoing links from web pages. Samples: dk,, nu

Time and harvest-data

Field v1 v2 Description
last_modified * As reported by the web server. Note that this is not a reliable value as servers are not reliable.
last_modified_year * The year part of sort_modified
crawl_date * The full and authoritative timestamp for when the resource was collected. For coarser grained (and faster) statistics, consider using crawl_date_year.
crawl_date_year * The year part of the crawl_date.
publication_date * The publication date as reported by the web server. Not authoritative.
publication_date_year * The year part of the publication_date.
arc_orig * Where the ARC file originated from. Possible values are sb & kb. If used for faceting, it is recommended to use enum: facet=true&facet.field=arc_orig&f.arc_orig.facet.method=enum.
arc_job * The ID of the harvest job as used by the Danish Net Archive when performing a crawl.


Field v1 v2 Description
url * * The resource URL as requested by the harvester. Consider using url_norm instead to reduce the amount of noise.
author * As stated in PDFs, Word documents, presentations etc. Unfortunately the content is highly unreliably and with a high degree of garbage.
content_type * The MIME type returned by the web server.
content_length * The size of the raw resource, measured in bytes. Consider using this with Solr range faceting.
content_encoding * The character set for textual resources.
content_language * Auto-detected language of the resource. Unreliable for small text samples, but quite accurate on larger ones.
content_type_norm * * Highly normalised content type. Possible values are: html, text, pdf, other, image, audio, excel, powerpoint, video& word.
content_type_ version, full, tika, droid, served, ext * Variations of content type, resolved from servers and third party tools.
server * The web server, as self-reported.
generator * The web page generator.
elements_used * All HTML-elements used on web pages.

Search & stored fields

It is not recommended to sort, group or facet on the following fields. If it is relevant to do so, DocValues can be enabled for v3.

Field v1 v2 Description
id * * The unique Solr-ID of the resource. Used together with highlighting or for graph-exploration.
source_files_s * The name of the ARC file and the offset of the resource. Sample:

This can be used as a CDX-lookup replacement by limiting the fields returned:


arc_harvest * The harvest-ID from the crawler.
hash * SHA1-hash of the content. Can be used for finding exact duplicates.
ssdeep_hash_bs_3, ssdeep_hash_bs_6, ssdeep_hash_ngram_bs_3, ssdeep_hash_ngram_bs_6 * Fuzzy hashes. Can be used for finding near-duplicates.
content * * Available as content_text in v1. The full extracted text of the resource. Used for text-mining or highlighting:



  1. Get core data for a single page:
    q=url:"" crawl_year:2010 content_type_norm:html&

    gives us

    "docs": [
            "source_file_s": "86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735",
            "url": "",
            "crawl_date": "2010-06-18T14:33:29Z"
  2. Request the resource 86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735 from storage, extract all links to images, css etc. The result is a list of URLs like
  3. Make a new request for the URLs from #2, grouped by unique URL, sorted by temporal distance to the originating page:
    q=url:("" OR 
    group.sort=abs(sub(ms(2010-06-18T14:33:29Z), crawl_date)) asc

    gives us

    "groups": [
              "groupValue": "",
              "doclist": {
                "numFound": 331,
                "start": 0,
                "docs": [
                    "source_file_s": "87154-32-20100624134901-00003-sb-prod-har-005.arc@7259371",
                    "url": "",
                    "crawl_date": "2010-06-24T13:51:10Z"
              "groupValue": "",
              "doclist": {
                "numFound": 796,
                "start": 0,
                "docs": [
                    "source_file_s": "86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735",
                    "url": "",
                    "crawl_date": "2010-06-18T14:33:29Z"

Et voilà: Reconstruction of a webpage from a given point in time, using only search and access to the (W)ARC-files.

Heuristically correct top-X facets

May 30, 2015 by

For most searches in our Net Archive, we have acceptable response time, due to the use of sparse faceting with Solr. Unfortunately as well as expectedly, some of the searches are slow. Response times in minutes slow, if we’re talking worst case. It is tied to the number of hits: Getting top-25 most popular links from pages about hedgehogs will take a few hundred milliseconds. Getting the top-25 links from all pages from 2010 takes minutes. Visualised, the response times looks like this:

Massive speed drop-off for higher result sets

Massive speed drop-off for higher result sets

Everything beyond 1M hits is slow, everything beyond 10M hits is coffee time. Okay for batch analysis, but we’re aiming for interactive use.

Get the probably correct top-X terms by sampling

Getting the top-X terms for a given facet can be achieved by sampling: Instead of processing all hits in the result set, some of them are skipped. The result set iterator conveniently provides an efficient advance-method, making this very easy. As we will only use sampling with larger result sets, there should be enough data to be quite sure that the top-25 terms are the correct ones, although their counts are somewhat off.

This of course all depends on how high X is in top-X, concrete corpus etc. The biggest danger is clusters of content in the corpus, which might be skipped. Maybe the skipping could be made in small steps? Process 100 documents, skip 500, process the next 100…? Tests will have to be made.

Fine count the top-X terms

With the correct terms being isolated, precisely those term can be fine counted. This is nearly the same as vanilla distributed faceting, with the exception that all shards must fine count all the top-X terms, instead of only the terms they had not already processed earlier.

Of course the fine counting could be skipped altogether, which would be faster and potentially very usable for interactive exploratory use, where the exact counts does not really matter.

But there’s no guarantee?

No. Do remember that vanilla Solr distributed faceting is also a best-effort, with the same guarantee as above: The terms are not guaranteed to be the correct ones, but their counts are.

Seems simple enough

Ticket #38 for sparse faceting has been opened and we could really use this in the Danish Net Archive Search. No promises though.

Note 2015-05-30

Knut Anton Bøckman mentioned on Twitter that Primo has a faceting mechanism that looks similar to my proposal. It seems that Primo uses the top-200 hits to select the facets (or rather terms?), then do a fine-count on those.

It might work well to base the term selection on the top hits, rather than sampling randomly through all the hits, but I am afraid that 200 is so small a sample that some of the terms will differ from the right ones. I understand the need for a small number though: Getting the top-million hits or just top-hundred-thousand is costly.

Alternative counter tracking

April 30, 2015 by

Warning: Bit-fiddling ahead.

The initial driver for implementing Sparse Faceting was to have extraction-time scale with the result set size, instead of with the total number of unique values in the index. From a performance point of view, this works very well in the current implementation: A side car structure keeps track of which counters has been updated. It is a simple array of pointers (integers really) and sampling indicates that a size of about 8% of the total amount of counters works okay.

So 8%, expressed as Java integers? That is 4*0.08*#values bytes.

Less abstract, we have a use case with 600M values/shard. Rounding each counter up to nearest power of 2, the theoretical lower bound of memory needed for representing the counters are all the bits that can change during updates. For a sample shard that is about 140MB.

N-plane counters for the same shard takes up 144MB + 157MB, where the first 144MB are shared between all counter instances. So they are pretty close to the theoretical lower bound – at least the extra instances. Note: They can be forced to be practically at the lower bound, but that impacts performance.

Back to the tracker, because N-plane needs one to scale its performance with result set size. Tracker size was 4*0.08*#values bytes, which for the test shard is 192MB. For those extra counter instances, the tracker ends up using more memory that the counters it tracks.

What we need is something better to efficiently keep track of the touched values, where efficient is measured both in time and space. In fancy talk, that is a succinct data structure.

Implicit markers

With n-plane counters, each bit of a given counter is stored on a separate plane (at least conceptually): If a counter has a max of 5, it needs 3 bits to store the count and is thus spread across the lower 3 planes.

Suppose we treated the lowest plane separately from the rest of the planes: Instead of supplying the bit at position 0 for the usual binary representation, it simply states if 1 should be added to the number or not. Confused? Let’s represent the numbers from 0 to 8 in standard binary as well as our new system:

Decimal  0  1   2    3    4     5     6     7     8
Binary   0  1  10   11  100   101   110   111  1000  
Binary+  0  1  11  101  111  1001  1011  1101  1111

Note that the representation is not quite as compact as true binary; e.g. the tipping point for going from 3 to 4 planes is from 7 to 8 for binary, but from 4 to 5 for binary+. Loosely estimated, it will cost 1 extra bit/value to use the new representation. However, there is zero extra cost for counters with maximum value 1. It turns out that most of the outgoing links in our net archive corpus are unique, so about 400M of the 600M satisfies this. So the extra memory cost of the extended binary will be 200M / 8 bits/byte = 25MB.

Having a not-zero flag solves a performance issue with the current n-plane counters, but it also opens up for…

Bitmap based sparse tracking

The goal here is to quickly track & locate the counters that have a non-zero value. Underneath the hood, a plane in a n-plane counter is just an array of longs. With implicit markers, checking whether there is one or more counters with non-zero values in a group of 64 counters is a simple matter of checking whether a long is equal to 0.

With our 600M counters, that means about 10M checks, reading 75MB data sequentially from heap. Not bad, but still a substantial fixed overhead.

So what if we create a bitmap with a single bit representing each long in the backing structure of the first plane. If the backing long has 0 updated counters, the bitmap entry is 0, for everything else it is 1. That would require #counters/64 bits or 1.2MB. Locating the non-zero counters then becomes a matter of reading 18K longs sequentially and checking for 0. Now the fixed overhead seems manageable.

The bitmap would of course have to be maintained during updates, introducing an extra get-set to the act of incrementing a counter. However, the old tracker used a single set (and simpler logic), so the difference is not that great.

New idea vs. tried and true

Memory wise the new tracker would need about 26.2MB for 600M counters, where the old one needs 192MB. An obvious improvement.

For very low hit counts, the new tracker would loose: A fixed overhead of 18K checks is still more work than loading a few values directly from an array.

The complexity of the new tracker is far higher, but it might turn out to be fastest in most cases as iteration of the tracked counters are guaranteed to be in sequential order, where the existing tracker is random order.

Measuring N-plane time & space

April 30, 2015 by

This article explores the boundaries of the experimental sparse faceting code, both in terms of processing time and in terms of space requirements. The code base has just been updated and the new features are available as a Solr 4.8 compatible binary download, as well as source.

Faceting overview (you can skip this)

The core performance-critical part of sparse faceting as well as vanilla String fc-faceting for Solr can be divided in three phases:

  1. Count the occurrences of the terms in the facet field
  2. Find the X highest counters
  3. Resolve the Strings for the counters

Sparse faceting started with the idea of improving on #2 by only considering the counters that were actually touched by #1. Later on it improved on #3 when running distributed in SolrCloud. Now #1 is being handled by introducing multi-threaded counting.

Memory-wise there are three main faceting structures:

  1. A map from documents to term ordinals
  2. A counter structure, with 1 counter per unique term
  3. Structures for resolving ordinals to Strings

Memory overhead is very dependent on how the fields are indexed in Solr. The current recommendation for faceting is to index in a StrField (which stores the value verbatim) as well as enabling DocValues, which lowers memory requirements considerably. There is still a base cost of using DocValues (personal rule of thumb: 1 byte/String) with the current default DocValuesFormat, but that is allegedly hard to avoid without compromising performance significantly.

#1: In the usual case, a Solr index consists of multiple segments. In order to do faceting across these segments, a mapping from segment-ordinals to global ordinals is created. This scales linear to the number of terms in all the segments. In the special case of a fully optimized shards, there is no need for this mapping and thus the memory overhead of #1 will be zero. The Danish Net Archive Search uses fully optimized shards.

#2: Vanilla Solr uses a simple int[]-array to hold the counters: Great for low- to medium-cardinality faceting, but quite costly (2.4GB per concurrent call) if you want to facet on 600M unique values. With n-plane counters for sparse faceting, the density of the counting structures is twice the theoretical lower limit for a single counter (pedantic note: It really is a bit more than twice the theoretical lower limit, as we round up to nearest power of two – when somebody implements an efficient arithmetic coding counter, give me a shout). For the shards in the Danish Net Archive, this is about 300MB. Extra counter instances, used for concurrent searches, takes up only the theoretical lower limit, or about 150MB.
To get fast calculation of the top-X counters after the counting step, an extra structure is needed, which brings the total overhead up to about 3-4 times the theoretical lower limit. This can be disabled, but with cardinality in the hundreds of millions, this will impose a very heavy performance penalty.

#3: Having paid the initial cost of using DocValues, the resolving of ordinals to Strings is practically free in terms of memory. With non-DocValued fields, some of the Strings are stored in memory to ensure performing lookups.

Multi threaded faceting

Threading aims to lower latency (response time). It only works when there is a surplus of CPU power, so the typical scenario is setups with heavy but infrequent searches. Infrequent just means that multiple concurrent searches are rare. Vanilla Solr supports multi threaded faceting in two ways:

  1. Multiple facet fields can be processed in parallel.
  2. Using facet.method=fcs, multiple segments can be processed in parallel. However, this comes with a non-trivial merge cost that involves resolving of extra Strings.

Recently in-segment threading was added to sparse faceting. Technically this also covers #2 above, with the added benefit of not needing the merge step.

The tricky part of in-segment threaded counting was to get fast thread-safe updates of the packed counter structures, but for once amount of counters helps: The counter updates are lock-free and opportunistic, using AtomicLongArrays compareAndSet, which is nearly as fast as plain array-updates. In the rare case of collisions, the thread waits the lowest amount of time possible, before trying again.


The two term occurrence counters relevant to this experiment are:

  1. n-plane, which is small but uses complex logic and requires multiple random memory accesses (an average of 2 reads + 2 writes) when a counter is incremented. Its size, including the sparse structure, is 500MB for this experiment.
  2. packed, which is a variant of Packed64Singleblock using an AtomicLongArray as backing store. It has relatively simple logic and requires only 1 read and 1 write, when a counter is incremented. It saves a bit of space compared to the 2,4GB for vanilla Solr int[], but is three times larger (1,5GB) than n-plane.

Test setup

As we have recently started re-indexing everything in the Danish Net Archive (and since yours truly was a bit too quick on the fingers and accidentally deleted a couple of terabytes of perfectly fine shards), we only have a small sample corpus to test on. Hopefully it will be somewhat representative.

Recap: The Net Archive Search at Statsbiblioteket uses fully optimized shards of 900GB / 250M documents. One of the fields is outgoing links from webpages. Each shard has 600M unique values in the links-field, with 6 billion references from documents to values in that field.

Solr needs 2.2GB of heap just to open 1 such shard and perform a non-faceted search. With n-plane faceting, the minimum heap requirement rises to 3.3GB. In production we use a max heap size of 8GB as we have multiple facet fields and want to have wriggle room.

Due to the aforementioned circumstances, we currently only have 4 such shards in our SolrCloud to test on. They are handled by a 16 core Xeon machine with 256GB of RAM, so the hardware power vs. index size is very favourable, compared to the expected full setup with 23 shards.

Testing was done with random words from a Danish dictionary, one request at a time. The amount of threads used for counting was varied between 1, 2 & 4.



4 shards of 900GB / 250M docs with 600M unique values in the facet field, tested with 1, 2 & 4 counting threads

4 shards of 900GB / 250M docs with 600M unique values in the facet field, tested with 1, 2 & 4 counting threads

The same graph, with 80 seconds as max for the y-axis:

4 shards of 900GB / 250M docs with 600M unique values in the facet field, tested with 1, 2 & 4 counting threads

4 shards of 900GB / 250M docs with 600M unique values in the facet field, tested with 1, 2 & 4 counting threads

In the legends, _t1 means 1 thread, _t2 is two threads etc. Some observations:

  1. There is something strange going on with packed in the very low end (1-1000 hits): Responses should be near-instantaneous in that area. This should be investigated further.
  2. packed is markedly faster relative to n-plane, although the overall speed means that the absolute difference is only visible from 100K hits and up.
  3. Threading gives a significant boost from 10M hits and up.

Note that the graphs only cover hit counts up to 100M, although there are 1000M documents in the index. Searching for random words does not really give that many hits: Besides the catch-all-term http, the second most popular term og (Danish for and) only occurs in ⅓ of the corpus.

Getting above 100M hits out of the 1000M documents in the current corpus is certainly possible by issuing or-queries, range-queries or similar, but it it not something that is likely to happen very often during interactive use.

Nevertheless, it must be stressed that worst-case performance is poor. Measuring the catch-all *:*-query (1,066,091,283 hits), the timing was:

Threads N-plane ms Packed ms
1 587,202 310,249
2 321,569 210,712
4 201,532 123,263

For interactive use, random response times in minutes is unacceptable. Besides adding copious amounts of hardware, some ideas for handling that problem are

  • Start by doing a fast non-facet search to get hit count. If that number indicates long faceting time, show a warning to the user. Potentially this could be a rough countdown.
  • Add sampling functionality, where the sample-count only uses the first 1M hits. Note that the hits are not processed in order of relevance, so the sample will be made up of hits from the beginning of the overall index structure (most likely the oldest records).
    A more advanced sampler could advance through the result set to distribute the sampling points better.

The small amount of shards means that there is a lot of CPU power free on the test machine. As it gets filled, there will be less surplus power and the effect of threading is likely to go down. Our setup might end up not getting any advantage out of threaded counting.

On the other hand, the low heap overhead of faceting with n-plane means that we will probably enable faceting on the 600M unique values field. It makes sense to pay the 1100MB heap overhead per shard to get fine-grained graph analysis; especially when the cost of additional counter instances is only about 350MB.

Facet filtering

April 10, 2015 by

In generation 2 of our net archive search we plan to experiment with real time graphs: We would like to visualize links between resources and locate points of interest based on popularity. Our plan is to use faceting with Solr on 500M+ unique links per shard, which is a bit of challenge in itself. To make matters worse, plain faceting does not fully meet the needs of the researchers. Some sample use cases for graph building are

  1. The most popular resources that pages about gardening links to overall
  2. The most popular resources that pages on a given site links to externally
  3. The most popular images that pages on a given site links to internally
  4. The most popular non-Danish resources that Danish pages links to
  5. The most popular JavaScripts that all pages from a given year links to

Unfortunately, only the first one can be solved with plain faceting.

Blacklists & whitelists with regular expressions

The idea is to filter all viable term candidates through a series of blacklists and whitelists to check whether they should be part of the facet result or not. One flexible way of expressing conditions on Strings is with regular expressions. The main problem with that approach is that all the Strings for the candidates must be resolved, instead of only the ones specified by facet.limit.

Consider the whitelist condition .*wasp.* which matches all links containing the word wasp. That is a pretty rare word overall, so if a match-all query is issued and the top 100 links with the wasp-requirement are requested, chances are that millions of terms must be resolved to Strings and checked, before the top 100 allowed ones has been found. On the other hand, a search for gardening would likely have a much higher chance of wasp-related links and would thus require far less resolutions.

An extremely experimental (written today) implementation of facet filtering has been added to the pack branch of Sparse Faceting for Solr. Correctness testing has been performed, where testing means “tried it a few times and the results looked plausible”. Looking back at the cases in the previous section, facet filtering could be used to support them:

  1. The most popular resources that pages about gardening links to overall
  2. The most popular resources that pages on a given site links to externally[^/]*example\.com
  3. The most popular images that pages on a given site links to internally[^/]*example\.com/.*\.(gif|jpg|jpeg|png)$
  4. The most popular non-Danish resources that Danish pages links to
  5. The most popular JavaScripts that all pages from a given year links to

Some questions like “The most popular resources larger than 2MB in size linked to from pages about video” cannot be answered directly with this solution as they rely on the resources at the end of the links, not just the links themselves.

Always with the performance testing

Two things of interest here:

  1. Faceting on 500M+ unique values (5 billion+ DocValues references) on a 900GB single-shard index with 200M+ documents
  2. Doing the trick with regular expressions on top

Note the single-shard thing! The measurements should not be taken as representative for the final speed of the fully indexed net archive, which will be 50 times larger. As we get more generation 2 shards, the tests will hopefully be re-run.

As always, Sparse Faceting is helping tremendously with the smaller result sets. This means that averaging the measurements to a single number is highly non-descriptive: Response times varies from < 100ms for a few thousand hits to 5 minutes for a match-all query.

Performance testing used a single thread to issue queries with random words from a Danish dictionary. The Solr server was a 24 core Intel i7 machine (only 1 active core due to the unfortunate single-threaded nature of faceting) with 256GB of RAM (200GB free for disk cache) and SSDs. All tests were with previously unused queries. 5 different types of requests were issued:

  1. no_facet: as the name implies, just a plain search with no faceting
  2. sparse: Sparse Faceting on the single links-field with facet limit 25
  3. regexp_easy: Sparse Faceting with whitelist regular expression .*htm.* which is fairly common in links
  4. regexp_evil: Sparse Faceting with whitelist regular expression .*nevermatches.* effectively forcing all terms in the full potential result set to be resolved and checked
  5. solr: Vanilla Solr faceting
900GB, 200M+ docs, 500M+ unique values, 5 billion+ references

900GB, 200M+ docs, 500M+ unique values, 5 billion+ references


  • Sparse Faceting without regular expressions (purple) performs just as well with 500M+ values as it did with previous tests of 200M+ values.
  • Using a regular expression that allows common terms (green) has moderate impact on performance.
  • The worst possible regular expression (orange) has noticeable impact at 10,000 hits and beyond. At the very far end at match-all, the processing time was 10 minutes (versus 5 minutes for non-regular expression faceting). This is likely to be highly influenced by storage speed and be slower with more shards on the same machine.
  • The constant 2 second overhead of vanilla Solr faceting (yellow) is highly visible.


Worst case processing times has always been a known weakness of our net archive search. Facet filtering exacerbates this. As this is tightly correlated to the result set size, which is fast to calculate, adding a warning with “This query is likely to take minutes to process” could be a usable bandage.

With that caveat out of the way, the data looks encouraging so far; the overhead for regular expressions was less than feared. Real-time graphs or at least fill-the-coffee-cup-time graphs seems doable. At the cost of 2GB of extra heap per shard to run the faceting request.

Additional notes 2015-04-11

@maxxkrakoa noted “@TokeEskildsen you wrote Solr facet is 1 thread. facet.threads can change that – but each field will only be able to use one thread each.“. He is right and it does help significantly for our 6 field faceting. For single field faceting, support for real multi-thread counting would be needed.

The simple way of doing multi-thread counting is to update multiple copies of the counter structure and merge them at the end. For at 500M+ field, that is likely to be prohibitive with regards to both memory and speed: The time used for merging the multiple counters would likely nullify the faster counter update phase. Some sort of clever synchronization or splitting of the counter space would be needed. No plans yet for that part, but it has been added to “things to ponder when sleep is not coming”-list.

Additional notes 2015-04-13

It seems that Java 1.7’s AtomicLongArray performs very well: Quite comparable to plain long[] in the context of multi-millions of counters, where contention is low. This raises the probability of implementing true threaded faceting markedly.


Get every new post delivered to your Inbox.