Sparse facet counting on a web archive

This post is a folow-up to Sparse facet counting on a real index. Where the last post explored using a sparse counter for faceting on author on Statsbibliotekets index of library material, this post will focus on faceting on url for our test index of archived web pages.

The index and the url field

The test index is 1TB (or 959 GB to be precise) with 420 million documents. The documents are made from harvested web pages and linked files of all types. Tika was used for the extraction of meta data. All documents has exactly one URL. Some documents are harvested multiple times, so there are only 315 million unique URLs instead of 420 million. The most popular URL occurs 3303 times . The index is optimized down to a single segment, but this has very little impact on tag collection speed.

Sparse counting

As described in Fast faceting with high cardinality and small result set, adding an overhead of 1/40 to the counting structure allows for faster counting with small result sets and graceful fallback to larger result sets. Using 32 bit integers as counters, this means 4*315 Mbyte for the counters themselved + 7 MByte for the sparse tracker. Including fluff, this turns out to be 1228 MByte for our test corpus.

Sparse packed counting

But wait! We know that there are at most 3303 occurrences of any single tag, so why use 32 bit integers as counters? 2^12 = 4096: We only need 12 bits for each. PackedInts to the rescue and we only need 506 MByte for a full counter structure. Less than half of the non-packed version.

Testing methodology

As this test is about the difference between non-sparse and sparse facet counting, the test was executed twice and only the results from the second run were collected. This ensured that the I/O cache was warmed so that the impact of the term-matching searches and document retrieval was minimized. As we have no logged queries, random words from the Danish iSpell dictionary were used. Only the results with hits were collected. Queries were issued sequentially with a 10ms wait between each. Time are full response times, measured through JMeter. In general, each query rarely returned more than 100 hits (which shows that we need better queries than random words).

Results

Exposed faceting on 1TB web archive

The very low spread is due to the searcher being fully warmed for the used queries, which also explain the 5 ms response times for non-faceted queries. Plain exposed faceting takes half a second, because all 315 million counters are iterated for each extraction. On the other hand, the small result sets means that sparse and packed only needs to process a few hundred entries. Notice that the performance of packed is practically identical to sparse.

Should anyone wonder: Unwarmed non-faceted searches for one or two random words varied between 10 and 1000 ms for this index.

Hold your horses

This all looks fantastic, but do remember that the hits/corpus_size ratio is extreme here. Our previous test with our library index is a better indicator of what can be expected and that “only” showed double the speed when using sparse.

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, Faceting, Hacking, Low-level, Lucene, Performance, Solr, Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s