CDX musings

March 18, 2016 by

This is about web archiving, corpus creation and replay of web sites. No fancy bit fiddling here, sorry.

There is currently some debate on CDX, used by the Wayback Engine, Open Wayback and other web archive oriented tools, such as Warcbase. A CDX Server API is being formulated, as is the CDX format itself. Inspired by a post by Kristinn Sigurðsson, here comes another input.

cdx_02

CDX components, diagram by Ulrich Karstoft Have

CDX Server API

There is an ongoing use case-centric discussion of needed features for a CDX API. Compared to that, the CDX Server API – BETA seems a bit random. For example: A feature such as regexp-matching on URLs can be very heavy on the backend and open op for easy denial of service (intentional as well as unintentional). It should only be in the API if it is of high priority. One way of handling all wishes is to define a core API with the bare necessities and add extra functionality as API extensions. What is needed and what is wanted?

The same weighing problem can be seen for the fields required for the server. Kristinn discusses CDX as format and boils the core fields down to canonicalized URL, timestamp, original URL, digest, record type, WARC filename and WARC offset. Everything else is optional. This approach matches the core vs. extensions functionality division.

With optional features and optional fields, a CDX server should have some mechanism for stating what is possible.

URL canonicalization and digest

The essential lookup field is the canonicalised URL. Unfortunately that is also the least formalised, which is really bad from an interoperability point of view. When the CDX Server API is formalised, a strict schema for URL canonicalisation is needed.

Likewise, the digest format needs to be fixed, to allow for cross-institution lookups. As the digests do not need to be cryptographically secure, the algorithm chosen (hopefully) does not become obsolete with age.

It would be possible to allow for variations of both canonicalisation and digest, but in that case it would be as extensions rather than core.

CDX (external) format

CDX can be seen as a way of representing a corpus, as discussed on the RESAW workshop in december 2015.

  • From a shared external format perspective, tool-specific requirements such as sorting of entries or order of fields are secondary or even irrelevant.
  • Web archives tend to contain non-trivial amounts of entries, so the size of a CDX matters.

With this in mind, the pure minimal amount of fields would be something like original URL, timestamp and digest. The rest is a matter of expanding the data from the WARC files. On a more practical level, having the WARC filename and WARC offset is probably a good idea.

The thing not to require is the canonicalized URL: It is redundant, as it can be generated directly from the original URL, and it unnecessarily freezes the canonicalization method to the CDX format.

Allowing for optional extra fields after the core is again pragmatic. JSON is not the most compact format when dealing with tables of data, but it has the flexibility of offering entry-specific fields. CDXJ deals with this, although it does not specify the possible keys inside of the JSON blocks, meaning that the full CDX file has to be iterated to get those keys. There is also a problem of simple key:value JSON entries, which can be generically processed, vs. complex nested JSON structures, which requires implementation-specific code.

CDX (internal) format

Having a canonicalized and SURTed URL as the primary field and having the full CDX file sorted is an optimization towards specific tools. Kris touches lightly on the problem with this by suggesting that the record type might be better positioned as the secondary field (as opposed to timestamp) in the CDX format.

It follows easily that the optimal order or even representation of fields depends on tools as well as use case. But how the tools handle CDX data internally really does not matter as long as they expose the CDX Server API correctly and allows for export in an external CDX format. The external format should not be dictated by internal use!

CDX (internal vs. external) format

CDX is not a real format yet, but tools do exist and they expects some loosely defined common representation to work together. As such it is worth considering if some of the traits of current CDXes should spill over to an external format. Primarily that would be the loosely defined canonicalized URL as first field and the sorted nature of the files. In practice that would mean a near doubling of file size due to the redundant URL representation.

CDX implementation

The sorted nature of current CDX files has two important implications: Calculating the intersections between two files is easy and lookup on primary key (canonicalised URL) can algorithmically be done in O(log n) time using binary search.

In reality, simple binary search works poorly on large datasets, due to the lack of locality. This is worsened by slow storage types such as spinning drives and/or networked storage. There are a lot of tricks to remedy this, from building indexes to re-ordering the entries. The shared theme is that the current non-formalised CDX files are not directly usable: They require post-processing and extensions by the implementation.

The take away is that the value of a binary search optimized external CDX format is diminishing as scale goes up and specialized implementations are created. Wholly different lookup technologies, such as general purpose databases or search engines, has zero need for the format-oriented optimizations.

 

 

Faster grouping, take 1

January 18, 2016 by

A failed attempt of speeding up grouping in Solr, with an idea for next attempt.

Grouping at a Statsbiblioteket project

We have 100M+ articles from 10M+ pages belonging to 700K editions of 170 newspapers in a single Solr shard. It can be accessed at Mediestream. If you speak Danish, try searching for “strudsehest”. Searches are at the article level, with the results sorted by score and grouped by edition, with a maximum of 10 articles / edition. Something like this:

q=strudsehest&group=true&group.field=edition&group.limit=10

This works well for most searches. But for the heavier ones, response times creeps into seconds, sometimes exceeding the 10 second timeout we use. Not good. So what happens in a grouped search that is sorted by document score?

  1. The hits are calculated
  2. A priority queue is used to find the top-X groups with the highest scores
    1. For each hit, calculate its score
    2. If the score is > the lowest score in the queue, resolve the group value and update the priority queue
  3. For each of the top-X groups, a priority queue is created and filled with document IDs
    1. For each hit, calculate is score and resolve its group value (a BytesRef)
    2. If the group value matched one of the top-X groups, update that group’s queue
      1. Updating the queue might involve resolving multiple field values for the document, depending on in-group sorting
  4. Iterate the top-X groups and resolve the full documents

Observation 1: Hits are iterated twice. This is hard to avoid if we need more than 1 entry in each group. An alternative would be to keep track of all groups until all the hits has been iterated, but this would be extremely memory costly with high cardinality fields.

Observation 2: In step 3.1, score and group resolving is performed for all hits. It is possible to use the same logic as step 2.1, where the group is only resolved if the score is competitive.

Attempt 1: Delayed group resolving

The idea in observation 2 has been implemented as a kludge-hacky-proof-of-concept. Code is available at the group_4_10 branch at GitHub for those who like hurt.

When the hits are iterated the second time, all scores are resolved but only the group values for the documents with competitive scores are resolved. So how well does it work?

run=1_collated

Lazy group value resolving for Solr

Observation: Optimized (aka lazy group value resolving) grouping is a bit slower than vanilla Solr grouping for some result sets, probably the ones where most of the group values has to be resolved. For other result sets there is a clear win.

It should be possible to optimize a bit more and bring the overhead of the worst-case optimized groupings down to near-zero. However, since there are so few best-case result sets and since the win is just about a third of the response time, I do not see this optimization attempt as being worth the effort.

Idea: A new level of lazy

IMPORTANT: I am a klutz and did not read the Solr code properly. Please skip ahead to the next section.

Going back to the algorithm for grouping we can see that “resolving the value” occurs multiple times. But what does it mean?

With DocValued terms, this is really a two-step process: The DocValue ordinal is requested for a given docID (blazingly fast) and the ordinal is used to retrieve the term (fast) in the form of a BytesRef. You already know where this is going, don’t you?

Millions of “fast” lookups accumulates to slow and we don’t really need the terms as such. At least not before we have to deliver the final result to the user. What we need is a unique identifier for each group value and the ordinal is exactly that.

But wait. Ordinals are not comparable across segments! We need to map the segment ordinals to a global structure. Luckily this is exactly what happens when doing faceting with facet.method=fc, so we can just scrounge the code from there.

With this in mind, the algorithm becomes

  1. The hits are calculated
  2. A priority queue is used to find the top-X groups with the highest scores
    1. For each hit, calculate its score
    2. If the score is > the lowest score in the queue, resolve the group value ordinal and update the priority queue
  3. For each of the top-X groups, a priority queue is created and filled with document IDs
    1. For each hit, resolve its group value segment-ordinal and convert that to global ordinal
    2. If the group value ordinal matches one of the top-X groups, update that group’s queue
      1. Updating the queue might involve resolving the document score or resolving multiple field value ordinals for the document, depending on in-group sorting
  4. Iterate the top-X groups and resolve the Terms from the group value ordinals as well as the full documents

Note how the logic is reversed for step 3.1, prioritizing value ordinal resolving over score calculation. Experience from the facet code suggests that ordinal lookup is faster than score calculation.

This idea has not been implemented yet. Hopefully it will be done Real Soon Now, but no promises.

Update 2016-01-28: Read the code properly

The Solr code does not resolve BytesRefs before it is necessary in step 3. On the contrary, it resolved the segment-specific group ordinals from the BytesRefs delivered by step 2 and uses fast lookup of those ordinals before calculating the score.

So while the first pass (determining the groups) seems optimizable by postponing BytesRef resolving, the second pass does not have any apparent performance failings.

The ones that got away

November 13, 2015 by

Two and a half ideas of improving Lucene/Solr performance that did not work out.

Track the result set bits

At the heart of Lucene (and consequently also Solr and ElasticSearch), there is a great amount of doc ID set handling. Abstractly speaking, a doc ID set is a collection of internal document IDs. If there are 10 documents in the index and a query matches the first and the fourth, the set would be [0, 3] as we count from 0. Every time there is filtering or faceting, bit sets are used.

SortedIntDocSet in Solr is one example of a document ID set. The IDs are stored after each other in an int[]. This takes up 32 bits per document in the set and is used for small result sets only.

FixedBitSet in Lucene is another example. It reserves 1 bit for each document in the full corpus. If the bit for a document is set, it means that the document is part of the set. All the bits are stored in a long[], which makes most operations quite fast: To calculate the union, the relevant code is

int pos = Math.min(numWords, otherNumWords);
while (--pos >= 0) {
  thisArr[pos] |= otherArr[pos];
}

FixedBitSet is very widely used. It is referenced in the Lucene/Solr code 451 times! As opposed to SortedIntDocSet et al, it can contain result sets of any size and reading or changing a specific bit is O(1). It does have a weakness though: Sparse sets.

Nearly all operations on FixedBitSet takes time linear to the total number of documents in the index, not to the number of documents in the set. Calculating the union of [5, 87] and [1, 5, 1056978], when there are 20 million documents in the index, requires stepping through all the 300,000 longs (20M / 64) in each array to check for set bits.

Idea: Tracking changes to the structure works really well for sparse faceting, so why not for FixedBitSets? Add a new tracking bitset where each bit corresponds to a single long in the underlying FixedBitSet. If the tracking bit is set, the FixedBitSet long contains at least one set bit. To locale set bits, iterate the tracking bitset and look for longs != 0. This reduces the number of unnecessary lookups by a factor 64. Of course a third bitset can be added on top of the two, reducing unnecessary lookups by a total factor of 4096.

An implementation is available at github with a performance test. Instructions for running is in the TRACKED_README.txt file.  Some reduced sample runs, where open if the vanilla FixedBitSet and tracked is the tracked version:

2K / 20M bits set (0.01% filled)
test         impl        ms
union        open      0.42
union        tracked   0.16
cardinality  open      0.34
cardinality  tracked   0.09
icount       open      0.59
icount       tracked   0.57
nextSetBit   open      0.31
nextSetBit   tracked   0.12
iterator     open      0.32
iterator     tracked   0.14

20K / 20M bits set (0.1% filled)
test         impl        ms
union        open      0.81
union        tracked   1.08
cardinality  open      0.34
cardinality  tracked   0.49
icount       open      0.58
icount       tracked   2.18
nextSetBit   open      0.58
nextSetBit   tracked   0.79
iterator     open      0.68
iterator     tracked   0.80

200K / 20M bits set (1.0% filled)
test         impl        ms
union        open      3.51
union        tracked   4.93
cardinality  open      0.43
cardinality  tracked   1.48
icount       open      0.59
icount       tracked  10.26
nextSetBit   open      3.21
nextSetBit   tracked   4.24
iterator     open      3.98
iterator     tracked   2.54

Technically the implementation delivers: Sparse result sets are processed faster. Unfortunately most operations require the sets to be less than 0.1% filled and worse, the overhead for non-sparse bitsets is prohibitive. There might be some special places where the tracked bitset can be used, but it is not viable as a general replacement for FixedBitSet.

Better scaling of top-X searches

This idea got pretty far with LUCENE-6828 JIRA and an analysis of the speed up.

Requesting the top-X relevance ranked results from Lucene/Solr is pretty much the core use case. The underlying implementation uses a binary heap based priority queue with Java Objects. Due to the cache-adverse nature of binary heaps and the large overhead of tiny Objects, this does not scale well: Do not ask top top-1M hits and really do not ask for top-100M.

Using an array based bit packed heap implementation yields superior space- and speed-performance for top-100K+ requests (3x+ for both memory and speed, plus a lot less GC activity; see the previously linked analysis for details), with the advantages growing with the number of concurrent requests. No win for small top-X requests, but no big loss either.

So success? Sorta, kinda. Turns out that there are generally two scenarios:

  1. Request a relatively modest top-X, such af top-10 or top-1000.
  2. Export the <em>full</em> result set, which can be millions or billions.

As the array-based heap has no advantage over the vanilla Solr heap for small result sets and as full exports are much better handled by Solr streaming export, there is near-zero need for the middle ground that a tracked bitset provides. It is a solution looking for a problem.

Cache-friendly priority queue

Since the vanilla Solr binary heap is cache-adverse, why not try and fix that problem? Yes, totally ignoring the “there is no problem as nobody requests large top-X results”-part here.

The idea is fairly simple: Binary heaps works well when the heap can be fully contained in the CPU cache, so making a heap of heaps means that most operations will be done on data that are already cached. There is a very nice write up on this at Playful Programming: Cache optimizing a priority queue.

This was implemented in Java at the hitdocs-trunk branch. Performance tests are in org.apache.lucene.search.TestHitQueue. Measurements below are from testPQPerformanceAtomicHeapsLimited, where sentinel is vanilla Solr, packed is the array-based heap described above and bheap4 is the best performing heaps-within-heaps implementation. Lower percentages are best and the winning implementation for each test is marked with ■.

Threads       Top-X      Hits      Sentinel         Packed         BHeap4 
      1     1000000    100000   118.49 100%     11.06   9%■    20.47  17% 
      1     1000000   1000000  1356.56 100%    133.51  10%■   270.82  20% 
      1     1000000   2000000  2216.50 100%    566.67  26%    472.43  21%■
      1     1000000   5000000  3406.35 100%    827.23  24%    771.33  23%■
      1     1000000  10000000  4183.84 100%   1054.94  25%   1012.78  24%■
      4     1000000    100000   246.00 100%     22.38   9%■    41.73  17% 
      4     1000000   1000000  2154.64 100%    232.00  11%■   504.80  23% 
      4     1000000   2000000  3356.96 100%   1057.01  31%    772.90  23%■
      4     1000000   5000000  5022.50 100%   1484.44  30%   1227.93  24%■
      4     1000000  10000000  6503.33 100%   1910.10  29%   1891.28  29%■
     16     1000000    100000  1310.95 100%     68.29   5%■   119.87   9% 
     16     1000000   1000000  6731.49 100%    605.90   9%■  1405.25  21% 
     16     1000000   2000000 11882.59 100%   3713.82  31%   2744.36  23%■
     16     1000000   5000000 19589.75 100%   5389.25  28%   4447.21  23%■
     16     1000000  10000000 25522.07 100%   6921.96  27%   4843.61  19%■

Note: The cases where hits <= top-X (100K and 1M) are special as packed sorts result sets with hits <= top-X using the more cache-friendly merge sort. This explains why packed is twice as fast as bheap4 for these cases. A similar optimization could be implemented for bheap4.

Looking through rose-tinted glasses, bheap4 is a bit better than packed (and they are both vastly superior to sentinel). But the implementation of bheap4 is bordering on voodoo, with the multi-level offset calculations and all the bit-tricks needed to avoid doing expensive multiplications and modulo calculations.

Hopefully the speed of bheap can be improved further – the Playful Programming author Björn Fahller certainly got fine results with his C++ implementation. But until that happens, its miniscule gains are not worth the added complexity.

Speeding up core search

October 5, 2015 by

Issue a query, get back the top-X results. It does not get more basic with Solr. So great win if we can improve on that, right? Truth be told, the answer is still “maybe”, but read on for some thoughts, test code and experimental results.

Getting the top-X results

  • The X in top-X is the maximum result set size we want back from a request.
  • A result set is the documents actually returned – this can be the same or less than X.
  • hits is the number of documents that matched the query – this can be much higher than the size of the result set.

When a top-X search is performed, Solr uses a Priority Queue to keep track of the result set. Such a queue is often implemented as a heap and so it is with Solr: The relevant class in the source is HitQueue.

A ScoreDoc represents a single result document, in the form of a document ID and a score (and sometimes a shard index, but we ignore that for now). The HitQueue holds ScoreDocs and is a simple beast: Add a new ScoreDoc and it will check to see if it has a higher score than any of the existing ScoreDocs. If so, the new ScoreDoc is added. If the queue is full, the ScoreDoc with the lowest score is removed.

Potential problems with the current implementation

  • A simple binary heap has poor memory locality: This means that an insert into the heap often results in multiple memory accesses scattered in the heap structure. This is not a problem for a tiny heap as everything will be in the CPU L2 cache, but for larger heaps that means a lot of uncached memory accesses.
  • The HitQueue in Solr uses ScoreDoc objects as elements in the heap. When the HitQueue is created, it is filled with dummy-ScoreDocs, called sentinels. There are multiple problems with that.
    • Each ScoreDoc object takes up 28 bytes on the heap. If 30 concurrent requests asks for top-2500000, that takes up 2GB of RAM, no matter what the actual result size is. I mention those specific numbers as that was the case for a person on the solr IRC channel.
    • Each ScoreDoc object is temporary, which means a lot of allocations and a lot of work for the garbage collector to clean up. In the previous case of the top-2500000, the JVM was doing stop-the-world garbage collections half the time.
    • The use of sentinel objects means that the heap is pre-filled with elements that will always be less than any real elements. Adding an element means sifting it from the top of the heap to the bottom. Not a problem for small heaps, but with larger ones it means unnecessary memory thrashing until the heap is full of new elements.
  • If the queue is not filled to its max size (the X in top-X), the ongoing maintaining of a heap structure is not the most efficient solution. For that, it would be better to simply collect all elements and do a merge-sort or similar, when they are to be delivered in order.

To sum it up: Solr’s queue in theory works well when requesting a small number of results, but poorly for large numbers. And indeed, that is part of the shared experience with Solr.

Lately CursorMark has been added to Lucene & Solr, which allows for efficient paging of results. One could say that this renders the optimization of requests for large result sets is a moot point, but hey, it’s a challenge.

Experimental improvements

Switching to a better heap algorithm would be a worthy experiment, but that is hard so we will skip that for now. Instead we will do away with all those pesky ScoreDoc objects.

The two relevant parts of a ScoreDoc are docID (an integer) and score ( a float). The sorting order of two ScoreDocs is determined primarily by the score and secondarily by the docID. There are a lot of such comparisons when using a heap.

In Java, the bits of a float can be extracted by Float.floatToRawIntBits, which produces an integer. This is a very fast operation. Interestingly enough, the sort order of two positive floats is preserved in the integer representations. This means that a ScoreDoc can be converted into an atomic long with Float.floatToRawIntBits(score) << 32 | docID and that two such longs are directly comparable.

With each element as a long, the object-guzzling ScoreDoc[] in the heap turns into a long[]. Besides taking only 8 bytes/element instead of 28, there is a much higher chance of CPU L2 cache hits as the longs in a long[] are packed tightly together on the heap.

Handling the case of a not-fully-filled heap is simple: Just add the elements to the array at the next free space. If the array gets full, heap-sort it and continue using it as a heap. If the queue does not run full, use the faster merge-sort to deliver the results in order.

Explorative experimentation

Which are just fancy words for poor methodology. Unfortunately the HitQueue used in Solr is not easily replaceable (it extends PriorityQueue, which has a lot of non-overridable methods). So to get a gist of which ideas are worth pursuing, we turn to micro benchmarks. Shalin Shekhar Mangar suggested using JMH. A sound advice which is postponed for now in favour of “just run it a lot of times, alternating between implementations”.

The test bench is simple: Start a bunch of Threads, each running a test in parallel. Each test instantiates a queue implementation, fills it with random docID/score-pairs, then empties it. To guard against noisy neighbours, all threads tests the same implementation and finish fully, before switching to the next implementation.

For completeness, a Solr HitQueue sans sentinels is also tested. Spoiler: Turning off the sentinel looks like an extremely easy win for large top-X requests.

Experimental results – the original hypothesis

The idea was to improve on large Top-X, so let’s look at top-1M. The test was with 1, 4 & 16 threads on a quad-core i7 laptop and with varying amounts of hits. The raw output of the testPQPerformanceReport1M follows:

Threads       Top-X      Hits      Sentinel    No_Sentinel         Packed 
      1     1000000        10    15.06 100%      0.50   3%■     0.97   6% 
      1     1000000       100    13.56 100%      0.47   3%■     0.97   7% 
      1     1000000      1000    13.28 100%      0.56   4%■     1.20   9% 
      1     1000000     10000    21.21 100%      1.28   6%■     1.58   7% 
      1     1000000    100000    86.04 100%     18.43  21%      8.34  10%■
      1     1000000   1000000   954.23 100%    447.40  47%     87.39   9%■
      1     1000000  10000000  3245.70 100%   3544.06 109%    895.89  28%■

      4     1000000        10    27.31 100%      1.47   5%■     3.35  12% 
      4     1000000       100    25.68 100%      1.58   6%■     2.97  12% 
      4     1000000      1000    25.79 100%      1.45   6%■     3.04  12% 
      4     1000000     10000    33.42 100%      2.27   7%■     2.95   9% 
      4     1000000    100000   119.99 100%     19.50  16%     11.52  10%■
      4     1000000   1000000  1456.82 100%    576.17  40%    134.46   9%■
      4     1000000  10000000  5934.11 100%   4278.37  72%   1385.38  23%■

     16     1000000        10   131.92 100%      3.26   2%■     9.79   7% 
     16     1000000       100   120.81 100%      4.08   3%■     8.76   7% 
     16     1000000      1000   124.63 100%      3.01   2%■    10.30   8% 
     16     1000000     10000   162.60 100%      4.68   3%■    10.49   6% 
     16     1000000    100000   485.46 100%     84.27  17%     32.81   7%■
     16     1000000   1000000  4702.79 100%   1787.32  38%    368.57   8%■
     16     1000000  10000000 16563.52 100%  10964.12  66%   4197.17  25%■

Below each implementation (Sentinel, No_Sentinel and Packed) are two columns: How many milliseconds it took to initialize, fill & empty the queue, followed by how long it took relative to the vanilla Solr Sentinel implementation (this is of course 100% for the Sentinel implementation itself). The fastest implementation for any given test is marked with a black box ■.

  • Sentinel starts out relatively slow for small amounts of Hits and gets dog-slow, when the amount of Hits gets to 1M+.
  • No_Sentinel is a lot better for the smaller hit counts (it does not have to do all the initialization) and markedly better up to 1M Hits.
  • Packed is slower than No_Sentinel for the smaller hit counts, but as the number of hits rises, it pulls markedly ahead. There really is no contest beyond 100K hits, where Packed is 3-10x as fast as vanilla Solr Sentinel.
    Notice how there is a relative performance hit, when the amount of Hits exceeds the queue size for Packed. This it where it switches from “just fill the array sequentially, then merge-sort at the end” to maintaining a heap.

Experimental results – small top-X requests

A superior implementation for large top-X requests is nice and it is certainly possible to choose implementation based on the value of top-X. But let’s check to see how it behaves for small top-X requests. Top-10 for instance:

Threads       Top-X      Hits      Sentinel    No_Sentinel         Packed 
      1          10        10     0.01 100%      0.01  49%      0.00  45%■
      1          10       100     0.01 100%      0.01  85%      0.01  81%■
      1          10      1000     0.01 100%■     0.02 204%      0.02 219% 
      1          10     10000     0.05 100%■     0.05 101%      0.08 152% 
      1          10    100000     0.59 100%      0.44  74%      0.37  62%■
      1          10   1000000     4.41 100%      4.53 103%      3.64  83%■
      1          10  10000000    42.15 100%     45.37 108%     36.33  86%■

      4          10        10     0.00 100%■     0.00 154%      0.01 329% 
      4          10       100     0.00 100%      0.01 348%      0.00  85%■
      4          10      1000     0.01 100%      0.01 120%      0.01  79%■
      4          10     10000     0.05 100%■     0.07 138%      0.08 154% 
      4          10    100000     0.47 100%      0.46  99%■     0.55 117% 
      4          10   1000000     4.71 100%      6.29 134%      4.45  95%■
      4          10  10000000    72.23 100%     60.09  83%     56.29  78%■

     16          10        10     0.00 100%      0.00  42%      0.00  39%■
     16          10       100     0.00 100%      0.00  72%      0.00  60%■
     16          10      1000     0.01 100%■     0.01 109%      0.01 128% 
     16          10     10000     0.08 100%      0.09 112%      0.07  80%■
     16          10    100000     1.48 100%      1.32  89%      1.12  76%■
     16          10   1000000    17.63 100%     18.74 106%     16.97  96%■
     16          10  10000000   207.65 100%    212.93 103%    192.81  93%■

Measurements for small number of Hits are somewhat erratic. That is to be expected as those tests are very fast (< 1ms) to complete, so tiny variations on machine load or garbage collection has a lot to say.

  • Sentinel is as expected very fast, compared to the top-1M test. No surprise there as asking for top-10 (or top-20) is the core request for Solr. The relative speed stays reasonably constant as the number of Hits grows.
  • No_Sentinel is a bit slower than Sentinel. That can be explained by a slightly different code path for insertion of elements – this should be investigated as there is a possibility of an easy optimization.
  • Packed is very interesting: Although it is not consistently better for the lower numbers of Hits (do remember we are still talking sub-millisecond times here), it is consistently a bit faster for the larger ones. There might not be a need for choosing between implementations.

Conclusion

Replacing the ScoreDoc-object using HitQueue in Solr with a bit-packing equivalent is a winning strategy in this test: The speed-up is substantial in some scenarios and the memory usage is ⅓rd of vanilla Solr.

The road ahead

This should of course be independently verified, preferably on other machine architectures. It should be investigated how to handle the case where shardIndex is relevant and more importantly discussed how to adjust the Lucene/Solr search code to use an alternative HitQueue implementation.

Light validation of Solr configuration

September 23, 2015 by

This week we were once again visited by the Edismax field alias bug in Solr: Searches with boosts, such as foo^2.5, stopped working. The problem arises when an alias with one or more non-existing fields is defined in solrconfig.xml and it is tedious to track down as one needs to check for existence of all the fields referenced.

We have a 10+ different Solr setups and we use aliases in most of them. So a quick script was whipped together: validate_config.sh, which (…wait for it…) validates Solr configs. Nothing fancy and it tends to report false positives when things are commented out in the XML files. Still, it does check that

  • all fields in schema.xml references existing field types
  • all copyFields in schema.xml references existing fields
  • all fields referenced in solrconfig.xml are defined in schema.xml
  • no alias in solrconfig.sh has the same name as a field in schema.xml

Some of these problems, such as referencing a non-existing fields in mlt.fl or pf in solrconfig.xml, are silent and hard to track down: Solr does not complain and searches seem to work. But in the case of misspellings of field names, the result is poorer quality searches as the intended functionality is not activated.

Cross-validation of fields used in solrconfig.xml and schema.xml would be nice to have as part of Solr core startup, but until then the validate_config.sh script might be of use. Get it at GitHub.

Sampling methods for heuristic faceting

July 30, 2015 by

Initial experiments with heuristic faceting in Solr were encouraging: Using just a sample of the result set, it was possible to get correct facet results for large result sets, reducing processing time by an order of magnitude. Alas, further experimentation unearthed that the sampling method was vulnerable to clustering. While heuristic faceting worked extremely well for most of the queries, it failed equally hard for a few of the queries.

The problem

Abstractly, faceting on Strings is a function that turns a collection of documents into a list of top-X terms plus the number of occurrences of these terms. In Solr the collection of documents is represented with a bitmap: One bit per document; if the bit is set, the document is part of the result set. The result set of 13 hits for an index with 64 documents could look like this:

00001100 01010111 00000000 01111110

Normally the faceting code would iterate all the bits, get the terms for the ones that are set and update the counts for those terms. The iteration of the bits is quite fast (1 second for 100M bits), but getting the terms (technically the term ordinals) and updating the counters takes more time (100 seconds for 100M documents).

Initial attempt: Sample the full document bitmap

The initial sampling was done by dividing the result set into chunks and only visiting those chunks. If we wanted to sample 50% of our result set and wanted to use 4 chunks, the parts of the result set to visit could be the one marked with red:

4 chunks: 00001100 01111110 00000000 01010111

As can be counted, the sampling hit 5 documents out of 13. Had we used 2 chunks, the result could be

2 chunks: 00001100 01111110 00000000 01010111

Only 2 hits out of 13 and not very representative. A high chunk count is needed: For 100M documents, 100K chunks worked fairly well. The law of large numbers helps a lot, but in case of document clusters (a group of very similar documents indexed at the same time) we still need both a lot of chunks and a high sampling percentage to have a high chance of hitting them. This sampling is prone to completely missing or over representing clusters.

Current solution: Sample the hits

Remember that iterating of the result bitmap itself is relatively fast. Instead of processing chunks of the bitmap and skipping between them, we iterate over all the hits and only update counts for some of them.

If the sampling rate is 50%, the bits marked with red would be used as sample:

50% sampling: 00001100 01111110 00000000 01010111

If the sampling rate is 33%, the bits for the sample documents would be

33% sampling: 00001100 01111110 00000000 01010111

This way of sampling is a bit slower than sampling on the full document bitmap as all bits must be visited, but it means that the distribution of the sampling points is as fine-grained as possible. It turns out that the better distribution gives better results, which means that the size of the sample can be lowered. Lower sample rate = higher speed.

Testing validity

A single shard from the Net Archive Search was used for testing. The shard was 900GB with 250M documents. Faceting was performed on the field links, which contains all outgoing links from indexed webpages. There are 600M unique values in that field and each document in the index contains an average of 25 links. For a full search on *:* that means 6 billion updates of the counter structure.

For this test, we look for the top-25 links. To get the baseline, a full facet count was issued for the top-50 links for a set of queries. A heuristic facet call was issued for the same queries, also for the top-50. The number of lines until the first discrepancy were counted for all the pairs. The ones with a count beneath 25 were considered faulty. The reason for the over provisioning was to raise the probability of correct results, which of course comes with a performance penalty.

The sampling size was set to 1/1000 the number of documents or roughly 200K hits. Only result sets sizes above 1M are relevant for validity as those below takes roughly the same time to calculate with and without sampling.

Heuristic validity for top 25/50

Heuristic validity for top 25/50

While the result looks messy, the number of faulty results was only 6 out of 116, for results set sizes above 1M. For the other 110 searches, the top-25 fields were correct. Raising the over provisioning to top-100 imposes a larger performance hit, but reduces the number of faulty results to 0 for this test.

Heuristic validity for top 25/100

Heuristic validity for top 25/100

Testing performance

The response times for full count faceting and heuristic faceting on the links field with over provision of 50 is as follows:

Heuristic speed for top 25/50

Heuristic speed for top 25/50

Switching from linear to logarithmic plotting for the y-axis immediately:

Heuristic speed for top 25/50, logarithmic Y-axis

Heuristic speed for top 25/50, logarithmic y-axis

It can be seen full counting rises linear with result size, while sampling time is near-constant. This makes sense as the sampling was done by updating counts for a fixed amount of documents. Other strategies, such as making the sampling rate a fraction of the result size, should be explored further, but as the validity plot shows, the fixed strategy works quite well.

The performance chart for over provisioning of 100 looks very much like the one for 50, only with slightly higher response times for sampling. As the amount of non-valid results is markedly lower for an over provisioning of 100, this seems like the best speed/validity trade off for our concrete setup.

Heuristic speed for top 25/100, logarithmic Y-axis

Heuristic speed for top 25/100, logarithmic y-axis

Summary

Heuristic faceting with sampling on hits gives a high probability of correct results. The speed up relative to full facet counting rises with result set size as sampling has near-constant response times. Using over provisioning allows for fine-grained tweaking between performance and chance of correct results. Heuristic faceting is expected to be the default for interactive use with the links field. Viability of heuristic faceting for smaller fields is currently being investigated.

As always, there is full source code and a drop-in sparse faceting Solr 4.10 WAR at GitHub.

Dubious guesses, counted correctly

June 19, 2015 by

We do have a bit of a performance challenge with heavy faceting on large result sets in our Solr based Net Archive Search. The usual query speed is < 2 seconds, but if the user requests aggregations based on large result sets, such as all resources from a whole year, processing time jumps to minutes. To get an idea of how bad it is, here’s a chart for response times when faceting on a field with 640M unique values.

Full faceting on field links

Faceting performance for field links with 600M unique values on a 900GB / 250M document index

Yes, the 80M hits query does take 16 minutes! As outlined in Heuristically correct top-X facets, it seems possible to use sampling to determine the top-X terms of the facet result and then fine count only those terms. The first version of heuristically correct top-X facets has now been implemented (download the latest Sparse faceting WAR to try it out), so time for evaluation.

Three facet fields

For this small scale evaluation we use just a single 900GB shard with 250M documents, generated from harvested web resources. The three fields of interests are

  • domain, with 1 value/document and 1.1M unique values. Of these, 230K are only referenced by a single document. The most popular domains are referenced by 4M documents.
    Intuitively, domain seems fitting for sampling, with relatively few unique values, not too many single instance values and a high amount of popular domains.
  • url, with 1 value/document and 200M unique values. Of these, 185M are only referenced by a single document. The most popular urls are referenced by 65K documents.
    Contrary to domain, url seems more problematic to sample, with relatively many unique values, a great deal of single value instances and not very many popular urls.
  • links, with 10 values/document and 600M unique values. Of these, 420M are only referenced by a single document. The most popular links are referenced by 8M documents.
    In between domain and url is links, with relatively many unique values, but only 10% of the 6 billion references being to single instance values and a with high amount of popular links.

Methodology

Caveat lector: This test should not be seen as authoritative, but rather an indicator of trade-offs. It was done on a heavy loaded machine, so real-world performance should be better. However, the relative differences in speed should not be to far off (tested ad hoc at a time where the machine was not under heavy load).

11 very popular terms were extracted from the general text field and used as query term, to simulate queries, heavy in terms of the number of hits.

Term Hits
og 77M
a 54M
10 50M
to 45M
ikke 40M
søg 33M
denne 25M
også 22M
under 18M
telefon 10M
indkøbskurv  7M

The top 25 terms were requested with facet.limit=25 and sampling was performed by using only part of the result set to update the facet counters. The sampling was controlled by 2 options:

  • fraction (facet.sparse.heuristic.fraction=0.xx): How much of the total number of documents to sample. If fraction is 0.01, this means 1% or 0.01*250M = 2.5M documents. Note that these are all the documents, not only the ones in the result set!
  • chunks (facet.sparse.heuristic.sample.chunks=xxx): How many chunks to split the sampling in. If chunks is 10 and fraction is 0.01, the 2.5M sample documents will be checked by visiting the first 250K, skipping ahead, visiting another 250K etc. 10 times.

To get a measure of validity, a full count was performed for each facet with each search term. The result from the samples runs were then compared to the full count, by counting the number of correct terms from the top to the first error. Example: If the fully counted result is

  • a (100)
  • b (80)
  • c (50)
  • d (20)
  • e (20)

and the sample result is

  • a (100)
  • b (80)
  • c (50)
  • e (20)
  • f (18)

then the score would be 3. Note that the counts themselves are guaranteed to be correct. Only the terms are unreliable.

Measurements

Facet field domain (1.1M unique values, 1 value/document)

First we sample using half of all documents (sample fraction 0.5), for varying amounts of chunks: c10 means 10 chunks, c10K means 10000 chunks. As facet.limit=25, highest possible validity score is 25. Scores below 10 are marked with red, scores from 10-19 are marked purple.

Term Hits c10 c100 c1K c10K c100K
og 77M 19 9 25 25 25
a 54M 20 4 25 25 25
10 50M 20 5 25 25 25
to 45M 18 14 25 25 25
ikke 40M 16 15 25 25 25
søg 33M 16 15 23 25 24
denne 25M 17 18 23 24 25
også 22M 17 12 25 25 25
under 18M 4 12 23 23 25
telefon 10M 16 8 23 23 25
indkøbskurv 7M 8 2 16 21 25
Validity plot for field domain

Heuristic faceting for field domain with 50% sampling

Looking at this, it seems that c1k (1000 chunks) is good, except for the last term indkøbskurv, and really good for 10000 chunks. Alas, sampling with half the data is nearly the full work.

Looking at a sample fraction of 0.01 (1% of total size) is more interesting:

Term Hits c10 c100 c1K c10K c100K
og 77M 4 9 24 23 25
a 54M 4 4 23 24 25
10 50M 3 4 23 25 20
to 45M 0 0 24 24 24
ikke 40M 5 13 25 24 25
søg 33M 0 0 20 21 25
denne 25M 0 0 18 22 23
også 22M 6 12 23 25 25
under 18M 3 4 22 23 24
telefon 10M 5 7 12 12 25
indkøbskurv 7M 0 1 4 16 23
Vailidty of field domain

Heuristic faceting for field domain with 1% sampling

Here it seems that c10K is good and c100K is really good, using only 1% of the documents for sampling. If we were only interested in the top-10 terms, the over-provisioning call for top-25 would yield valid results for both c10k and c100k. If we want all top-25 terms to be correct, over-provisioning to top-50 or something like that should work.

The results are viable, even with a 1% sample size, provided that the number of chunks is high enough. So how fast is it to perform heuristic faceting, as opposed to full count?

Faceting performance for field domain with 1% sampling

Faceting performance for field domain with 1% sampling

The blue line represents the standard full counting faceting, no sampling. It grows linear with result size, with worst case being 14 seconds. Sample based counting (all the other lines) also grows linear, but with worst case at 2 seconds. Furthermore the speed difference between the number of chunks is so small that choosing 100K chunks, and thereby the best chance of getting the viable results, is not a problem.

In short: Heuristic faceting on the domain field for large result sets is 4-7 times faster than standard counting, with a high degree of viability.

Facet field url (200M unique values, 1 value/document)

Validity of field url at 1% sampling

Heuristic faceting for field url with 1% sampling

Faceting performance for field url with 1% sampling

Faceting performance for field url with 1% sampling

The speed up is a modest 2-4 times for the url field, but worse the viability is low, even when using 100000 chunks. Raising the minimum result set size for heuristic faceting to 20M hits could conceivably work, but the url field still seems a poor fit. Considering that the url field does not have very many recurring values, this is not too surprising.

Facet field links (600M unique values, 10 values/document)

Heuristic faceting for field links with 1% sampling

Heuristic faceting for field links with 1% sampling

Faceting performance for field links with 1% sampling

Faceting performance for field links with 1% sampling

The heuristic viability of the links field is just as good as with the domain field: As long af the number of chunks is above 1000, sampling with 1% yields great results. The performance is 10-30 times that of standard counting. This means that the links field is an exceptionally well fit for heuristic faceting.

Removing the full count from the chart above reveals that worst-case in this setup is 22 seconds. Not bad for a result set of 77M documents, each with 10 references to any of 600M values:

Faceting performance for field links with 1% sampling

Faceting performance for field links with 1% sampling, no baseline shown

Summary

Heuristically correct faceting for large result sets allows us to reduce the runtime of our heaviest queries by an order of magnitude. Viability and relative performance is heavily dictated by the term count distribution for the concrete fields (the url field was a poor fit) and by cardinality. Anyone considering heuristic faceting should test viability on their corpus before enabling it.

Word of caution

Heuristic faceting as part of Solr sparse faceting is very new and not tested in production. It is also somewhat rough on the edges; simple features such as automatic over-provisioning has not been implemented yet.

Net Archive Search building blocks

June 9, 2015 by

An extremely webarchive-discovery and Statsbiblioteket centric description of some of the technical possibilities with Net Archive Search. This could be considered internal documentation, but we like to share.

There are currently 2 generations of indexes at Statsbiblioteket: v1 (22TB) & v2 (8TB). External access is currently to v1. As the features of v2 is a superset of v1, v1 will be disabled as soon as v2 catches up in terms of the amount of indexed material. ETA: July or August 2015.

Aggregation fields

The following fields has the DocValues-option enabled, meaning that it is possible to export them efficiently as well as doing sorting, grouping and faceting at a low memory cost.

Network

Field v1 v2 Description
url_norm * The resource URL, lightly normalised (lowercased, www removed, etc.) the same way as links. The 2 fields together can be used to generate graphs of resource interconnections.

This field is also recommended for grouping. To get unique URLs in their earliest versions, do

group=true&group.field=url_norm&group.sort=crawl_date+asc

links * Outgoing links from web pages, normalised the same way as url_norm. As cardinality is non-trivial (~600M values per TB of index), it is strongly recommended to enable low-memory mode if this field is used for faceting:
facet=true&facet.field=links&f.links.facet.sparse.counter=nplanez
host * * The host part of the URL for the resource. Example: vejret.tv2.dk.
domain * * The domain part of the URL for the resource. Example: tv2.dk.
links_hosts * The host part of all outgoing links from web pages.
links_domain * The domain part of all outgoing links from web pages.
links_public_suffixes * The suffix of all outgoing links from web pages. Samples: dk, org.uk, nu

Time and harvest-data

Field v1 v2 Description
last_modified * As reported by the web server. Note that this is not a reliable value as servers are not reliable.
last_modified_year * The year part of sort_modified
crawl_date * The full and authoritative timestamp for when the resource was collected. For coarser grained (and faster) statistics, consider using crawl_date_year.
crawl_date_year * The year part of the crawl_date.
publication_date * The publication date as reported by the web server. Not authoritative.
publication_date_year * The year part of the publication_date.
arc_orig * Where the ARC file originated from. Possible values are sb & kb. If used for faceting, it is recommended to use enum: facet=true&facet.field=arc_orig&f.arc_orig.facet.method=enum.
arc_job * The ID of the harvest job as used by the Danish Net Archive when performing a crawl.

Content

Field v1 v2 Description
url * * The resource URL as requested by the harvester. Consider using url_norm instead to reduce the amount of noise.
author * As stated in PDFs, Word documents, presentations etc. Unfortunately the content is highly unreliably and with a high degree of garbage.
content_type * The MIME type returned by the web server.
content_length * The size of the raw resource, measured in bytes. Consider using this with Solr range faceting.
content_encoding * The character set for textual resources.
content_language * Auto-detected language of the resource. Unreliable for small text samples, but quite accurate on larger ones.
content_type_norm * * Highly normalised content type. Possible values are: html, text, pdf, other, image, audio, excel, powerpoint, video& word.
content_type_ version, full, tika, droid, served, ext * Variations of content type, resolved from servers and third party tools.
server * The web server, as self-reported.
generator * The web page generator.
elements_used * All HTML-elements used on web pages.

Search & stored fields

It is not recommended to sort, group or facet on the following fields. If it is relevant to do so, DocValues can be enabled for v3.

Field v1 v2 Description
id * * The unique Solr-ID of the resource. Used together with highlighting or for graph-exploration.
source_files_s * The name of the ARC file and the offset of the resource. Sample: 43084-88-20090304190229-00002-kb-prod-wb-001.kb.dk.arc@54173743.

This can be used as a CDX-lookup replacement by limiting the fields returned:

fl=url,source_files_s&rows=500.

arc_harvest * The harvest-ID from the crawler.
hash * SHA1-hash of the content. Can be used for finding exact duplicates.
ssdeep_hash_bs_3, ssdeep_hash_bs_6, ssdeep_hash_ngram_bs_3, ssdeep_hash_ngram_bs_6 * Fuzzy hashes. Can be used for finding near-duplicates.
content * * Available as content_text in v1. The full extracted text of the resource. Used for text-mining or highlighting:

hl=true&hl.fl=content.

CDX-alternative

  1. Get core data for a single page:
    .../select?
    q=url:"http://www.dr.dk/" crawl_year:2010 content_type_norm:html&
    rows=1&fl=url,crawl_date,source_file_s&wt=json
    

    gives us

    "docs": [
          {
            "source_file_s": "86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735",
            "url": "http://www.dr.dk/",
            "crawl_date": "2010-06-18T14:33:29Z"
          }
        ]
    
  2. Request the resource 86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735 from storage, extract all links to images, css etc. The result is a list of URLs like http://www.dr.dk/design/www/global/img/global/DRLogos/DR_logo.jpg.
  3. Make a new request for the URLs from #2, grouped by unique URL, sorted by temporal distance to the originating page:
    .../select?
    q=url:("http://www.dr.dk/" OR 
           "http://www.dr.dk/design/www/global/img/global/DRLogos/DR_logo.jpg")&
    rows=5&fl=url,crawl_date,source_file_s&wt=json&
    group=true&group.field=url_norm&
    group.sort=abs(sub(ms(2010-06-18T14:33:29Z), crawl_date)) asc
    

    gives us

    "groups": [
            {
              "groupValue": "http://dr.dk/design/www/global/img/global/drlogos/dr_logo.jpg",
              "doclist": {
                "numFound": 331,
                "start": 0,
                "docs": [
                  {
                    "source_file_s": "87154-32-20100624134901-00003-sb-prod-har-005.arc@7259371",
                    "url": "http://www.dr.dk/design/www/global/img/global/DRLogos/DR_logo.jpg",
                    "crawl_date": "2010-06-24T13:51:10Z"
                  }
                ]
              }
            },
            {
              "groupValue": "http://www.dr.dk/",
              "doclist": {
                "numFound": 796,
                "start": 0,
                "docs": [
                  {
                    "source_file_s": "86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735",
                    "url": "http://www.dr.dk/",
                    "crawl_date": "2010-06-18T14:33:29Z"
                  }
                ]
              }
            }
          ]
    

Et voilà: Reconstruction of a webpage from a given point in time, using only search and access to the (W)ARC-files.

Heuristically correct top-X facets

May 30, 2015 by

For most searches in our Net Archive, we have acceptable response time, due to the use of sparse faceting with Solr. Unfortunately as well as expectedly, some of the searches are slow. Response times in minutes slow, if we’re talking worst case. It is tied to the number of hits: Getting top-25 most popular links from pages about hedgehogs will take a few hundred milliseconds. Getting the top-25 links from all pages from 2010 takes minutes. Visualised, the response times looks like this:

Massive speed drop-off for higher result sets

Massive speed drop-off for higher result sets

Everything beyond 1M hits is slow, everything beyond 10M hits is coffee time. Okay for batch analysis, but we’re aiming for interactive use.

Get the probably correct top-X terms by sampling

Getting the top-X terms for a given facet can be achieved by sampling: Instead of processing all hits in the result set, some of them are skipped. The result set iterator conveniently provides an efficient advance-method, making this very easy. As we will only use sampling with larger result sets, there should be enough data to be quite sure that the top-25 terms are the correct ones, although their counts are somewhat off.

This of course all depends on how high X is in top-X, concrete corpus etc. The biggest danger is clusters of content in the corpus, which might be skipped. Maybe the skipping could be made in small steps? Process 100 documents, skip 500, process the next 100…? Tests will have to be made.

Fine count the top-X terms

With the correct terms being isolated, precisely those term can be fine counted. This is nearly the same as vanilla distributed faceting, with the exception that all shards must fine count all the top-X terms, instead of only the terms they had not already processed earlier.

Of course the fine counting could be skipped altogether, which would be faster and potentially very usable for interactive exploratory use, where the exact counts does not really matter.

But there’s no guarantee?

No. Do remember that vanilla Solr distributed faceting is also a best-effort, with the same guarantee as above: The terms are not guaranteed to be the correct ones, but their counts are.

Seems simple enough

Ticket #38 for sparse faceting has been opened and we could really use this in the Danish Net Archive Search. No promises though.

Note 2015-05-30

Knut Anton Bøckman mentioned on Twitter that Primo has a faceting mechanism that looks similar to my proposal. It seems that Primo uses the top-200 hits to select the facets (or rather terms?), then do a fine-count on those.

It might work well to base the term selection on the top hits, rather than sampling randomly through all the hits, but I am afraid that 200 is so small a sample that some of the terms will differ from the right ones. I understand the need for a small number though: Getting the top-million hits or just top-hundred-thousand is costly.

Alternative counter tracking

April 30, 2015 by

Warning: Bit-fiddling ahead.

The initial driver for implementing Sparse Faceting was to have extraction-time scale with the result set size, instead of with the total number of unique values in the index. From a performance point of view, this works very well in the current implementation: A side car structure keeps track of which counters has been updated. It is a simple array of pointers (integers really) and sampling indicates that a size of about 8% of the total amount of counters works okay.

So 8%, expressed as Java integers? That is 4*0.08*#values bytes.

Less abstract, we have a use case with 600M values/shard. Rounding each counter up to nearest power of 2, the theoretical lower bound of memory needed for representing the counters are all the bits that can change during updates. For a sample shard that is about 140MB.

N-plane counters for the same shard takes up 144MB + 157MB, where the first 144MB are shared between all counter instances. So they are pretty close to the theoretical lower bound – at least the extra instances. Note: They can be forced to be practically at the lower bound, but that impacts performance.

Back to the tracker, because N-plane needs one to scale its performance with result set size. Tracker size was 4*0.08*#values bytes, which for the test shard is 192MB. For those extra counter instances, the tracker ends up using more memory that the counters it tracks.

What we need is something better to efficiently keep track of the touched values, where efficient is measured both in time and space. In fancy talk, that is a succinct data structure.

Implicit markers

With n-plane counters, each bit of a given counter is stored on a separate plane (at least conceptually): If a counter has a max of 5, it needs 3 bits to store the count and is thus spread across the lower 3 planes.

Suppose we treated the lowest plane separately from the rest of the planes: Instead of supplying the bit at position 0 for the usual binary representation, it simply states if 1 should be added to the number or not. Confused? Let’s represent the numbers from 0 to 8 in standard binary as well as our new system:

Decimal  0  1   2    3    4     5     6     7     8
Binary   0  1  10   11  100   101   110   111  1000  
Binary+  0  1  11  101  111  1001  1011  1101  1111

Note that the representation is not quite as compact as true binary; e.g. the tipping point for going from 3 to 4 planes is from 7 to 8 for binary, but from 4 to 5 for binary+. Loosely estimated, it will cost 1 extra bit/value to use the new representation. However, there is zero extra cost for counters with maximum value 1. It turns out that most of the outgoing links in our net archive corpus are unique, so about 400M of the 600M satisfies this. So the extra memory cost of the extended binary will be 200M / 8 bits/byte = 25MB.

Having a not-zero flag solves a performance issue with the current n-plane counters, but it also opens up for…

Bitmap based sparse tracking

The goal here is to quickly track & locate the counters that have a non-zero value. Underneath the hood, a plane in a n-plane counter is just an array of longs. With implicit markers, checking whether there is one or more counters with non-zero values in a group of 64 counters is a simple matter of checking whether a long is equal to 0.

With our 600M counters, that means about 10M checks, reading 75MB data sequentially from heap. Not bad, but still a substantial fixed overhead.

So what if we create a bitmap with a single bit representing each long in the backing structure of the first plane. If the backing long has 0 updated counters, the bitmap entry is 0, for everything else it is 1. That would require #counters/64 bits or 1.2MB. Locating the non-zero counters then becomes a matter of reading 18K longs sequentially and checking for 0. Now the fixed overhead seems manageable.

The bitmap would of course have to be maintained during updates, introducing an extra get-set to the act of incrementing a counter. However, the old tracker used a single set (and simpler logic), so the difference is not that great.

New idea vs. tried and true

Memory wise the new tracker would need about 26.2MB for 600M counters, where the old one needs 192MB. An obvious improvement.

For very low hit counts, the new tracker would loose: A fixed overhead of 18K checks is still more work than loading a few values directly from an array.

The complexity of the new tracker is far higher, but it might turn out to be fastest in most cases as iteration of the tracked counters are guaranteed to be in sequential order, where the existing tracker is random order.


Follow

Get every new post delivered to your Inbox.