Fast faceting with high cardinality and small result set

by

This is a follow-up to the idea presented  more than a year ago at http://sbdevel.wordpress.com/2013/01/23/large-facet-corpus-small-result-set/. It can be read independently of the old post.

The premise is simple: We have a Lucene/Solr index and we want to do some faceting. The facet field has high cardinality, which is a fancy way of saying that it has a lot of unique values. “A lot” is tens or hundreds of millions.

Old style counting

Ignoring all the little detail devils, getting the top 10 tags in a facet (sorting by count) is normally done like this:

// Init
counts = [#unique_tags]
// Search & collect
for each docID in hits {
  for each tagID in docID {
    counts[tagID]++
  }
}
// Extract
topTen = []
for tagID 0 ... #unique_tags {
  if (counts[tagID] > min(topTen)) {
   remove min(topTen) from topTen
   insert (tagID, counts[tagID]) in topTen
  }
}
// Result
create result from topTen
// Cleanup
for tagID 0 ... #unique_tags {
  counts[tagID] = 0
}

The init-part and the cleanup-part differs between implementations. Solr lets the JVM handle it by allocating new structures in the init-phase and dereferencing it in the cleanup-phase so that the garbage collector takes it. Our home brew SOLR-2412 caches the counters, which requires a cleanup after each run but has very stable memory and GC impact.

Notice how the extraction-phase and the cleanup-phase iterates all the tagIDs for the whole corpus, even if the result set itself is tiny? That is not very efficient. If we knew the number of unique tags for the result set in advance we could select between e.e. the big counter array and a small hash set for keeping track of the tags, but we do not have that luxury.

Track the counters

With Using Uninitialized Memory for Fun and Profit in mind, we create a new counter that is efficient for small result sets and with little speed penalty for large result sets. It is not too complicated:

TagCounterSparse {
  counts = [#unique_tags]
  updates = [#unique_tags / fraction]
  updatePointer = 0

  collect(tagID) {
    if (counts[tagID]++ == 0 && updatePointer != updates.length) {
      updates[updatePointer++] = tagID
    }
  }

  extract {
    topTen = []
    if (updatePointer != updates.length) {
      for each tagID in updates {
        if (counts[tagID] > min(topTen)) {
          remove min(topTen) from topTen
          insert (tagID, counts[tagID]) in topTen 
        }
      }
    } else {
      for each tagID in counts {
        if (counts[tagID] > min(topTen)) {
          remove min(topTen) from topTen
            insert (tagID, counts[tagID]) in topTen
        }
      }
    }
  }

  clear {
    if (updatePointer != updates.length) {
      for each tagID in updates {
        counts[tagID] = 0
      }
    } else {
      for each tagID in counts {
        counts[tagID] = 0
      }
    }
    updatePointer = 0
  }
}

What happens here is that a counter-tracker updates is introduced. When the count for a previously unseen tagID is increased, the tracker stores that tagID. If too many unique tagIDs are added, the tracker stops keeping track.

Extraction of top ten tags can be done in two ways. If there were too many unique tags in the result set, they are extracted exactly like the old implementation. If the result set was small enough, only the counts for the collected tagIDs are accessed. For really small result sets, this is blazingly fast.

Clearing is similar to extraction. Either it happens the old way or only the collected tagIDs are touched.

Considerations and gotchas

The sparse tag counter adds another layer of indirection, so if the result set is the whole corpus and if the updates is the same size as counts, all the phases will be slower than the old solution. We need to find out how large a part of counts we should keep track of. This is the fraction.

Another consideration is that the old iterate-all-tagIDs had the very nice property of accessing memory sequentially. The sparse solution is random access for each collected tagID, which is a lot slower. This heavily influences what the fraction should be.

Measure twice

An artificial Solr index was created. It had 20M documents and a single-value field with 20M unique values. Searches were performed with result sets consisting of every 2nd document, every 5th document, every 10th and so on. For each search the faceting time spend on collecting, extracting and clearing was measured. First results for each unique search were discarded and all searches repeated 5 times with the best results being extracted. All times are in milliseconds.

Old implementation

Every  Collect  Extract  Clear  Total
    2      527      153     12    692
    5      215       84     12    311
   10      112       56     12    180
   20       62       40     12    114
   30       44       33     12     89
   40       36       31     11     78
   50       29       30     12     71
  100       15       27     12     54
  200        8       25     12     45
  500        4       24     12     40
 1000        2       24     12     38
 5000        1       23     12     36

Notice how the extract time converges towards 20ms and the clear time is constant.

Sparse (#updates = #counts)

Every  Collect  Extract  Clear  Total
    2      544      453    160  1157
    5      221      183     63   467
   10      114       92     32   238
   20       63       46     16   125
   30       46       31     10    87
   40       38       23      8    69
   50       30       19      6    55
  100       15       10      3    28
  200        9        5      1    15
  500        3        1      0     4
 1000        1        0      0     1
 5000        1        0      0     1

Notice that the collect time is only a little worse than the old collect time, that the extract time is markedly better when the result set is every 40th document or less and that clear is also markedly faster for every 40th document or less.

This suggests that fraction should be 40. For this corpus, which is artificial. Your mileage will most certainly wary, but for this test we set it to 40.

sparse (#updates = #counts/40)

Every  Collect  Extract  Clear  Total
    2      545      151     12    708
    5      227       86     13    326
   10      118       56     13    187
   20       65       40     12    117
   30       47       35     13     95
   40       40       24      8     72
   50       31       19      6     56
  100       18       10      3     31
  200        9        5      2     16
  500        2        2      0      4
 1000        1        0      0      1
 5000        0        0      0      0

Notice how extraction and clear time is the same as for the old implementation for results sets with every 2th, 5th, 10th, 20th and 30th document. After that, extraction and clear times matches the sparse with #updates = #counts. It is the best of both worlds.

Conclusion

For this test, using a sparse tag collector, taking 1/40 more RAM than a standard collector, results in increasingly better relative performance the smaller the result set gets. The speed converges towards instant response. This comes at the price of slightly worse performance for large result sets.

Future

Right now this is just a proof of concept that resides in our local Summa-project (look for TestSingleSegmentOptimization.java). It needs to be tested on real world corpora and with logged queries before any real patches for Solr is to be considered. It will end up on SOLR-2412 though, when I find the time to upgrade it to Lucene/Solr 4.7.

3 Responses to “Fast faceting with high cardinality and small result set”

  1. Sparse facet counting on a real index | Software Development at Statsbiblioteket Says:

    […] A peekhole into the life of the software development department at the State Library of Denmark « Fast faceting with high cardinality and small result set […]

  2. Sparse facet counting on a web archive | Software Development at Statsbiblioteket Says:

    […] described in Fast faceting with high cardinality and small result set, adding an overhead of 1/40 to the counting structure allows for faster counting with small result […]

  3. Sparse facet counting without the downsides | Software Development at Statsbiblioteket Says:

    […] extraction process no longer needs to visit all counters.  A detailed explanation can be found at fast-faceting-with-high-cardinality-and-small-result-set. However, there are some peculiarities to sparse tracking that must be […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: