Long tail packed counters for faceting

Our home brew Sparse Faceting for Solr is all about counting: When calculating a traditional String facet such as

Author
- H.C.Andersen (120)
- Brothers Grimm (90)
- Arabian Nights (7)
- Carlo Collodi (5)
- Ronald Reagan (2)

the core principle is to have a counter for each potential term (author name in this example) and update that counter by 1 for each document with that author. There are different ways of handling such counters.

Level 0: int[]

Stock Solr uses an int[] to keep track of the term counts, meaning that each unique term takes up 32 bits or 4 bytes of memory for counting. Normally that is not an issue, but with 6 billion terms (divided between 25 shards) in our Net Archive Search, this means a whopping 24GB of memory for each concurrent search request.

Level 1: PackedInts

Sparse Faceting tries to be clever about this. An int can count from 0 to 2 billion, but if the maximum number of documents for any term is 3000, there will be a lot of wasted space. 2^12 = 4096, so in the case of maxCount=3000, we only need 12 bits/term to keep track of it all. Currently this is handled by using Lucene’s PackedInts to hold the counters. With the 6 billion terms, this means 9GB of memory. Quite an improvement on the 24GB from before.

Level 2: Long tail PackedInts with fallback

Packing counters has a problem: The size of all the counters is dictated by the maxCount. Just a single highly used term can nullify the gains: If all documents share one common term, the size of the individual counters will be log(docCount) bits. With a few hundred millions of documents, that puts the size to 27-29 bits/term, very close to the int[] representation’s 32 bits.

Looking at the Author-example at the top of this page, it seems clear that the counts for the authors are not very equal: The top-2 author has counts a lot higher than the bottom-3. This is called a long tail and it is a very common pattern. This means that the overall maxCount for the terms is likely to be a lot higher than the maxCount for the vast majority of the terms.

While I was loudly lamenting of all the wasted bits, Mads Villadsen came by and solved the problem: What if we keep track of the terms with high maxCount in one structure and the ones with a lower maxCount in another structure? Easy enough to do with a lot of processing overhead, but tricky do do efficiently. Fortunately Mads also solved that (my role as primary bit-fiddler is in serious jeopardy). The numbers in the following explanation are just illustrative and should not be seen as the final numbers.

The counter structure

We have 200M unique terms in a shard. The terms are long tail-distributed, with the most common ones having maxCount in the thousands and the vast majority with maxCount below 100.

We locate the top-128 terms and see that their maxCount range from 2921 to 87. We create an int[128] to keep track of their counts and call it head.

head
Bit 31 30 29 2 1 0
Term h_0 0 0 0 0 0 0
Term h_1 0 0 0 0 0 0
0 0 0 0 0 0
Term h_126 0 0 0 0 0 0
Term h_127 0 0 0 0 0 0

The maxCount for the terms below the 128 largest ones is 85. 2^7=128, so we need 7 bits to hold each of those. We allocate a PackedInt structure with 200M entries of 7+1 = 8 bits and call it tail.

tail
Bit 7* 6 5 4 3 2 1 0
Term 0 0 0 0 0 0 0 0 0
Term 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Term 199,999,998 0 0 0 0 0 0 0 0
Term 199,999,999 0 0 0 0 0 0 0 0

The tail has an entry for all terms, including those in head. For each of the large terms in head, we locate its position in tail. At the tail-counter, we set the value to term’s index in the head counter structure and set the highest bit to 1.

Let’s say that head entry term h_0 is located at position 1 in tail, h_126 is located at position 199,999,998 and h_127 is located at position 199,999,999. After marking the head entries in the tail structure, it would look like this:

tail with marked heads
Bit 7* 6 5 4 3 2 1 0
Term 0 0 0 0 0 0 0 0 0
Term 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Term 199,999,998 1 1 1 1 1 1 1 0
Term 199,999,999 1 1 1 1 1 1 1 1

Hanging on so far? Good.

Incrementing the counters

  1. Read the counter value from the tail structure: count = tail.get(ordinal)
  2. Check if bit 7 is set: if (count & 128 == 128)
  3. If bit 7 is set, increment the head counter: head.inc(count & 127)
  4. If bit 7 is not set, increment the tail counter: tail.set(ordinal, count+1)

Pros and cons

In this example, the counters takes up 6 billion * 8 bits + 25 * 128 * 32 bits = 5.7GB. The performance overhead, compared to the PackedInts version, is tiny: Whenever a head bit is encountered, there will be an extra read to get the old head value before writing the value+1. As head will statistically be heavily used, it is likely to be in Level 2/3 cache.

This is just an example, but it should be quite realistic as approximate values from the URL field in our Net Archive Index has been used. Nevertheless, it must be stressed that the memory gains from long tail PackedInts is highly dictated by the shape of the long tail curve.

Afterthought

It is possible to avoid the extra bit in tail by treating the large terms as any other term, until their tail-counters reaches maximum (127 in the example above). When a counter’s max has been reached, the head-counter can then be located using a lookup mechanism, such as a small HashMap or maybe just a linear scan through a short array with the ordinals and counts for the large terms. This would reduce the memory requirements to approximately 6 billion * 7 bits = 5.3GB. Whether this memory/speed trade-off is better or worse is hard to guess and depends on result set size.

Implementation afterthought

The long tail PackedInts could implement the PackedInts-interface itself, making it usable elsewhere. Its constructor could take another PackedInts filled with maxCounts or a histogram with maxbit requirements.

Heureka update 2015-02-27

There is no need to mark the head terms in the tail structure up front. All the entries in tail acts as standard counters until the highest bit is set. At that moment the bits used for counting switches to be a pointer into the next available entry in the head counters. The update workflow thus becomes

  1. Read the counter value from the tail structure: count = tail.get(ordinal)
  2. Check if bit 7 is set: if (count & 128 == 128)
  3. If bit 7 is set, increment the head counter: head.inc(count & 127)
  4. If bit 7 is not set, increment the tail counter: tail.set(ordinal, count+1)
  5. If the counter reaches bit 7, change the counter-bits to be pointer-bits:
    if ((count+1) == 128) tail.set(ordinal, 128 & headpos++)

Pros and cons

The new logic means that initialization and resetting of the structure is simply a matter of filling them with 0. Update performance will be on par with the current PackedInts implementation for all counters, whose value is within the cutoff. After that the penalty of an extra read is paid, but only for the overflowing values.

The memory overhead is unchanged from the long tail PackedInts implementation and still suffers from the extra bit used for signalling count vs. pointer.

Real numbers 2015-02-28

The store-pointers-as-values has the limitation that there can only be as many head counters as the maxCount for tail. Running the numbers on the URL-field for three of the shards in our net archive index resulted in bad news: The tip of the long tail shape was not very pointy and it is only possible to shave 8% of the counter size. Far less than the estimated 30%. The Packed64 in the table below is the current structure used by sparse faceting.

Shard 1 URL: 228M unique terms, Packed64 size: 371MB
tail BPV required memory saved head size
11 342MB 29MB / 8% 106
12 371MB 0MB / 0% 6

However, we are in the process of experimenting with faceting on links, which has quite a higher point in the long tail shape. From a nearly fully build test shard we have:

8/9 build shard links: 519M unique terms, Packed64 size: 1427MB
tail BPV required memory saved head size
15 1038MB 389MB / 27% 14132
16 1103MB 324MB / 23% 5936
17 1168MB 260MB / 18% 3129
18 1233MB 195MB / 14% 1374
19 1298MB 130MB / 9% 909
20 1362MB 65MB / 5% 369
21 1427MB 0MB / 0% 58

For links, the space saving was 27% or 389MB for the nearly-finished shard. To zoom out a bit: Doing faceting on links for our full corpus with stock Solr would take 50GB. Standard sparse faceting would use 35GB and long tail would need 25GB.

Due to sparse faceting, response time for small result sets is expected to be a few seconds for the links-facet. Larger result sets, not to mention the dreaded *:* query, would take several minutes, with worst-case (qualified guess) around 10 minutes.

Three-level long tail 2015-02-28

Previously:

  • pointer-bit: Letting the values in tail switch to pointers when they reach maximum has the benefit of very little performance overhead, with the downside of taking up an extra bit and limiting the size of head.
  • lookup-signal: Letting the values in tail signal “find the counter in head” when they reach maximum, has the downside that a sparse lookup-mechanism, such as a HashMap, is needed for head, making lookups comparatively slow.

New idea: Mix the two techniques. Use the pointer-bit principle until there is no more room in head. head-counters beyond that point all get the same pointer (all value bits set) in tail and their position in head is determined by a sparse lookup-mechanism ord2head.

This means that

  • All low-value counters will be in tail (very fast).
  • The most used high-value counters will be in head and will be referenced directly from tail (fast).
  • The less used high-value counters will be in head and will require a sparse lookup in ord2head (slow).

Extending the pseudo-code from before:

value = tail.get(ordinal)
if (value == 255) { // indirect pointer signal
  head.inc(ord2head.get(ordinal))
} else if (value & 128 == 128) { // pointer-bit set
  head.inc(value & 127)
} else { // term-count = value
  value++;
  if (value != 128) { // tail-value ok
    tail.set(value)
  } else { // tail-value overflow
    head.set(headpos, value)
    if (headpos < 127) { // direct pointer
      tail.set(128 & headpos++)
    } else { // indirect pointer
      tail.set(255)
      ord2head.put(ordinal, headpos++)
    }
  }
}

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, Faceting, Hacking, Low-level, Solr. Bookmark the permalink.

One Response to Long tail packed counters for faceting

  1. Pingback: N-plane packed counters for faceting | Software Development at Statsbiblioteket

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s