Dubious guesses, counted correctly

June 19, 2015 by

We do have a bit of a performance challenge with heavy faceting on large result sets in our Solr based Net Archive Search. The usual query speed is < 2 seconds, but if the user requests aggregations based on large result sets, such as all resources from a whole year, processing time jumps to minutes. To get an idea of how bad it is, here’s a chart for response times when faceting on a field with 640M unique values.

Full faceting on field links

Faceting performance for field links with 600M unique values on a 900GB / 250M document index

Yes, the 80M hits query does take 16 minutes! As outlined in Heuristically correct top-X facets, it seems possible to use sampling to determine the top-X terms of the facet result and then fine count only those terms. The first version of heuristically correct top-X facets has now been implemented (download the latest Sparse faceting WAR to try it out), so time for evaluation.

Three facet fields

For this small scale evaluation we use just a single 900GB shard with 250M documents, generated from harvested web resources. The three fields of interests are

  • domain, with 1 value/document and 1.1M unique values. Of these, 230K are only referenced by a single document. The most popular domains are referenced by 4M documents.
    Intuitively, domain seems fitting for sampling, with relatively few unique values, not too many single instance values and a high amount of popular domains.
  • url, with 1 value/document and 200M unique values. Of these, 185M are only referenced by a single document. The most popular urls are referenced by 65K documents.
    Contrary to domain, url seems more problematic to sample, with relatively many unique values, a great deal of single value instances and not very many popular urls.
  • links, with 10 values/document and 600M unique values. Of these, 420M are only referenced by a single document. The most popular links are referenced by 8M documents.
    In between domain and url is links, with relatively many unique values, but only 10% of the 6 billion references being to single instance values and a with high amount of popular links.

Methodology

Caveat lector: This test should not be seen as authoritative, but rather an indicator of trade-offs. It was done on a heavy loaded machine, so real-world performance should be better. However, the relative differences in speed should not be to far off (tested ad hoc at a time where the machine was not under heavy load).

11 very popular terms were extracted from the general text field and used as query term, to simulate queries, heavy in terms of the number of hits.

Term Hits
og 77M
a 54M
10 50M
to 45M
ikke 40M
søg 33M
denne 25M
også 22M
under 18M
telefon 10M
indkøbskurv  7M

The top 25 terms were requested with facet.limit=25 and sampling was performed by using only part of the result set to update the facet counters. The sampling was controlled by 2 options:

  • fraction (facet.sparse.heuristic.fraction=0.xx): How much of the total number of documents to sample. If fraction is 0.01, this means 1% or 0.01*250M = 2.5M documents. Note that these are all the documents, not only the ones in the result set!
  • chunks (facet.sparse.heuristic.sample.chunks=xxx): How many chunks to split the sampling in. If chunks is 10 and fraction is 0.01, the 2.5M sample documents will be checked by visiting the first 250K, skipping ahead, visiting another 250K etc. 10 times.

To get a measure of validity, a full count was performed for each facet with each search term. The result from the samples runs were then compared to the full count, by counting the number of correct terms from the top to the first error. Example: If the fully counted result is

  • a (100)
  • b (80)
  • c (50)
  • d (20)
  • e (20)

and the sample result is

  • a (100)
  • b (80)
  • c (50)
  • e (20)
  • f (18)

then the score would be 3. Note that the counts themselves are guaranteed to be correct. Only the terms are unreliable.

Measurements

Facet field domain (1.1M unique values, 1 value/document)

First we sample using half of all documents (sample fraction 0.5), for varying amounts of chunks: c10 means 10 chunks, c10K means 10000 chunks. As facet.limit=25, highest possible validity score is 25. Scores below 10 are marked with red, scores from 10-19 are marked purple.

Term Hits c10 c100 c1K c10K c100K
og 77M 19 9 25 25 25
a 54M 20 4 25 25 25
10 50M 20 5 25 25 25
to 45M 18 14 25 25 25
ikke 40M 16 15 25 25 25
søg 33M 16 15 23 25 24
denne 25M 17 18 23 24 25
også 22M 17 12 25 25 25
under 18M 4 12 23 23 25
telefon 10M 16 8 23 23 25
indkøbskurv 7M 8 2 16 21 25
Validity plot for field domain

Heuristic faceting for field domain with 50% sampling

Looking at this, it seems that c1k (1000 chunks) is good, except for the last term indkøbskurv, and really good for 10000 chunks. Alas, sampling with half the data is nearly the full work.

Looking at a sample fraction of 0.01 (1% of total size) is more interesting:

Term Hits c10 c100 c1K c10K c100K
og 77M 4 9 24 23 25
a 54M 4 4 23 24 25
10 50M 3 4 23 25 20
to 45M 0 0 24 24 24
ikke 40M 5 13 25 24 25
søg 33M 0 0 20 21 25
denne 25M 0 0 18 22 23
også 22M 6 12 23 25 25
under 18M 3 4 22 23 24
telefon 10M 5 7 12 12 25
indkøbskurv 7M 0 1 4 16 23
Vailidty of field domain

Heuristic faceting for field domain with 1% sampling

Here it seems that c10K is good and c100K is really good, using only 1% of the documents for sampling. If we were only interested in the top-10 terms, the over-provisioning call for top-25 would yield valid results for both c10k and c100k. If we want all top-25 terms to be correct, over-provisioning to top-50 or something like that should work.

The results are viable, even with a 1% sample size, provided that the number of chunks is high enough. So how fast is it to perform heuristic faceting, as opposed to full count?

Faceting performance for field domain with 1% sampling

Faceting performance for field domain with 1% sampling

The blue line represents the standard full counting faceting, no sampling. It grows linear with result size, with worst case being 14 seconds. Sample based counting (all the other lines) also grows linear, but with worst case at 2 seconds. Furthermore the speed difference between the number of chunks is so small that choosing 100K chunks, and thereby the best chance of getting the viable results, is not a problem.

In short: Heuristic faceting on the domain field for large result sets is 4-7 times faster than standard counting, with a high degree of viability.

Facet field url (200M unique values, 1 value/document)

Validity of field url at 1% sampling

Heuristic faceting for field url with 1% sampling

Faceting performance for field url with 1% sampling

Faceting performance for field url with 1% sampling

The speed up is a modest 2-4 times for the url field, but worse the viability is low, even when using 100000 chunks. Raising the minimum result set size for heuristic faceting to 20M hits could conceivably work, but the url field still seems a poor fit. Considering that the url field does not have very many recurring values, this is not too surprising.

Facet field links (600M unique values, 10 values/document)

Heuristic faceting for field links with 1% sampling

Heuristic faceting for field links with 1% sampling

Faceting performance for field links with 1% sampling

Faceting performance for field links with 1% sampling

The heuristic viability of the links field is just as good as with the domain field: As long af the number of chunks is above 1000, sampling with 1% yields great results. The performance is 10-30 times that of standard counting. This means that the links field is an exceptionally well fit for heuristic faceting.

Removing the full count from the chart above reveals that worst-case in this setup is 22 seconds. Not bad for a result set of 77M documents, each with 10 references to any of 600M values:

Faceting performance for field links with 1% sampling

Faceting performance for field links with 1% sampling, no baseline shown

Summary

Heuristically correct faceting for large result sets allows us to reduce the runtime of our heaviest queries by an order of magnitude. Viability and relative performance is heavily dictated by the term count distribution for the concrete fields (the url field was a poor fit) and by cardinality. Anyone considering heuristic faceting should test viability on their corpus before enabling it.

Word of caution

Heuristic faceting as part of Solr sparse faceting is very new and not tested in production. It is also somewhat rough on the edges; simple features such as automatic over-provisioning has not been implemented yet.

Net Archive Search building blocks

June 9, 2015 by

An extremely webarchive-discovery and Statsbiblioteket centric description of some of the technical possibilities with Net Archive Search. This could be considered internal documentation, but we like to share.

There are currently 2 generations of indexes at Statsbiblioteket: v1 (22TB) & v2 (8TB). External access is currently to v1. As the features of v2 is a superset of v1, v1 will be disabled as soon as v2 catches up in terms of the amount of indexed material. ETA: July or August 2015.

Aggregation fields

The following fields has the DocValues-option enabled, meaning that it is possible to export them efficiently as well as doing sorting, grouping and faceting at a low memory cost.

Network

Field v1 v2 Description
url_norm * The resource URL, lightly normalised (lowercased, www removed, etc.) the same way as links. The 2 fields together can be used to generate graphs of resource interconnections.

This field is also recommended for grouping. To get unique URLs in their earliest versions, do

group=true&group.field=url_norm&group.sort=crawl_date+asc

links * Outgoing links from web pages, normalised the same way as url_norm. As cardinality is non-trivial (~600M values per TB of index), it is strongly recommended to enable low-memory mode if this field is used for faceting:
facet=true&facet.field=links&f.links.facet.sparse.counter=nplanez
host * * The host part of the URL for the resource. Example: vejret.tv2.dk.
domain * * The domain part of the URL for the resource. Example: tv2.dk.
links_hosts * The host part of all outgoing links from web pages.
links_domain * The domain part of all outgoing links from web pages.
links_public_suffixes * The suffix of all outgoing links from web pages. Samples: dk, org.uk, nu

Time and harvest-data

Field v1 v2 Description
last_modified * As reported by the web server. Note that this is not a reliable value as servers are not reliable.
last_modified_year * The year part of sort_modified
crawl_date * The full and authoritative timestamp for when the resource was collected. For coarser grained (and faster) statistics, consider using crawl_date_year.
crawl_date_year * The year part of the crawl_date.
publication_date * The publication date as reported by the web server. Not authoritative.
publication_date_year * The year part of the publication_date.
arc_orig * Where the ARC file originated from. Possible values are sb & kb. If used for faceting, it is recommended to use enum: facet=true&facet.field=arc_orig&f.arc_orig.facet.method=enum.
arc_job * The ID of the harvest job as used by the Danish Net Archive when performing a crawl.

Content

Field v1 v2 Description
url * * The resource URL as requested by the harvester. Consider using url_norm instead to reduce the amount of noise.
author * As stated in PDFs, Word documents, presentations etc. Unfortunately the content is highly unreliably and with a high degree of garbage.
content_type * The MIME type returned by the web server.
content_length * The size of the raw resource, measured in bytes. Consider using this with Solr range faceting.
content_encoding * The character set for textual resources.
content_language * Auto-detected language of the resource. Unreliable for small text samples, but quite accurate on larger ones.
content_type_norm * * Highly normalised content type. Possible values are: html, text, pdf, other, image, audio, excel, powerpoint, video& word.
content_type_ version, full, tika, droid, served, ext * Variations of content type, resolved from servers and third party tools.
server * The web server, as self-reported.
generator * The web page generator.
elements_used * All HTML-elements used on web pages.

Search & stored fields

It is not recommended to sort, group or facet on the following fields. If it is relevant to do so, DocValues can be enabled for v3.

Field v1 v2 Description
id * * The unique Solr-ID of the resource. Used together with highlighting or for graph-exploration.
source_files_s * The name of the ARC file and the offset of the resource. Sample: 43084-88-20090304190229-00002-kb-prod-wb-001.kb.dk.arc@54173743.

This can be used as a CDX-lookup replacement by limiting the fields returned:

fl=url,source_files_s&rows=500.

arc_harvest * The harvest-ID from the crawler.
hash * SHA1-hash of the content. Can be used for finding exact duplicates.
ssdeep_hash_bs_3, ssdeep_hash_bs_6, ssdeep_hash_ngram_bs_3, ssdeep_hash_ngram_bs_6 * Fuzzy hashes. Can be used for finding near-duplicates.
content * * Available as content_text in v1. The full extracted text of the resource. Used for text-mining or highlighting:

hl=true&hl.fl=content.

CDX-alternative

  1. Get core data for a single page:
    .../select?q=url%3A%22http%3A%2F%2Fwww.dr.dk%2F%22
    +crawl_year%3A2010+content_type_norm%3Ahtml
    &rows=1&fl=url%2Ccrawl_date%2Csource_file_s
    &wt=json
    

    gives us

    "docs": [
          {
            "source_file_s": "86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735",
            "url": "http://www.dr.dk/",
            "crawl_date": "2010-06-18T14:33:29Z"
          }
        ]
    
  2. Request the resource 86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735 from storage, extract all links to images, css etc. The result is a list of URLs like http://www.dr.dk/design/www/global/img/global/DRLogos/DR_logo.jpg.
  3. Make a new request for the URLs from #2, grouped by unique URL, sorted by temporal distance to the originating page:
    .../select?q=url%3A(%22http%3A%2F%2Fwww.dr.dk%2F%22+OR+
    %22http%3A%2F%2Fwww.dr.dk%2Fdesign%2Fwww%2Fglobal%2Fimg%2Fglobal%2F
    DRLogos%2FDR_logo.jpg%22)
    &rows=5&fl=url%2Ccrawl_date%2Csource_file_s&wt=json
    &group=true&group.field=url_norm
    &group.sort=abs(sub(ms(2010-06-18T14:33:29Z),%20crawl_date))%20asc
    

    gives us

    "groups": [
            {
              "groupValue": "http://dr.dk/design/www/global/img/global/drlogos/dr_logo.jpg",
              "doclist": {
                "numFound": 331,
                "start": 0,
                "docs": [
                  {
                    "source_file_s": "87154-32-20100624134901-00003-sb-prod-har-005.arc@7259371",
                    "url": "http://www.dr.dk/design/www/global/img/global/DRLogos/DR_logo.jpg",
                    "crawl_date": "2010-06-24T13:51:10Z"
                  }
                ]
              }
            },
            {
              "groupValue": "http://www.dr.dk/",
              "doclist": {
                "numFound": 796,
                "start": 0,
                "docs": [
                  {
                    "source_file_s": "86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735",
                    "url": "http://www.dr.dk/",
                    "crawl_date": "2010-06-18T14:33:29Z"
                  }
                ]
              }
            }
          ]
    

Et voilà: Reconstruction of a webpage from a given point in time, using only search and access to the (W)ARC-files.

Heuristically correct top-X facets

May 30, 2015 by

For most searches in our Net Archive, we have acceptable response time, due to the use of sparse faceting with Solr. Unfortunately as well as expectedly, some of the searches are slow. Response times in minutes slow, if we’re talking worst case. It is tied to the number of hits: Getting top-25 most popular links from pages about hedgehogs will take a few hundred milliseconds. Getting the top-25 links from all pages from 2010 takes minutes. Visualised, the response times looks like this:

Massive speed drop-off for higher result sets

Massive speed drop-off for higher result sets

Everything beyond 1M hits is slow, everything beyond 10M hits is coffee time. Okay for batch analysis, but we’re aiming for interactive use.

Get the probably correct top-X terms by sampling

Getting the top-X terms for a given facet can be achieved by sampling: Instead of processing all hits in the result set, some of them are skipped. The result set iterator conveniently provides an efficient advance-method, making this very easy. As we will only use sampling with larger result sets, there should be enough data to be quite sure that the top-25 terms are the correct ones, although their counts are somewhat off.

This of course all depends on how high X is in top-X, concrete corpus etc. The biggest danger is clusters of content in the corpus, which might be skipped. Maybe the skipping could be made in small steps? Process 100 documents, skip 500, process the next 100…? Tests will have to be made.

Fine count the top-X terms

With the correct terms being isolated, precisely those term can be fine counted. This is nearly the same as vanilla distributed faceting, with the exception that all shards must fine count all the top-X terms, instead of only the terms they had not already processed earlier.

Of course the fine counting could be skipped altogether, which would be faster and potentially very usable for interactive exploratory use, where the exact counts does not really matter.

But there’s no guarantee?

No. Do remember that vanilla Solr distributed faceting is also a best-effort, with the same guarantee as above: The terms are not guaranteed to be the correct ones, but their counts are.

Seems simple enough

Ticket #38 for sparse faceting has been opened and we could really use this in the Danish Net Archive Search. No promises though.

Note 2015-05-30

Knut Anton Bøckman mentioned on Twitter that Primo has a faceting mechanism that looks similar to my proposal. It seems that Primo uses the top-200 hits to select the facets (or rather terms?), then do a fine-count on those.

It might work well to base the term selection on the top hits, rather than sampling randomly through all the hits, but I am afraid that 200 is so small a sample that some of the terms will differ from the right ones. I understand the need for a small number though: Getting the top-million hits or just top-hundred-thousand is costly.

Alternative counter tracking

April 30, 2015 by

Warning: Bit-fiddling ahead.

The initial driver for implementing Sparse Faceting was to have extraction-time scale with the result set size, instead of with the total number of unique values in the index. From a performance point of view, this works very well in the current implementation: A side car structure keeps track of which counters has been updated. It is a simple array of pointers (integers really) and sampling indicates that a size of about 8% of the total amount of counters works okay.

So 8%, expressed as Java integers? That is 4*0.08*#values bytes.

Less abstract, we have a use case with 600M values/shard. Rounding each counter up to nearest power of 2, the theoretical lower bound of memory needed for representing the counters are all the bits that can change during updates. For a sample shard that is about 140MB.

N-plane counters for the same shard takes up 144MB + 157MB, where the first 144MB are shared between all counter instances. So they are pretty close to the theoretical lower bound – at least the extra instances. Note: They can be forced to be practically at the lower bound, but that impacts performance.

Back to the tracker, because N-plane needs one to scale its performance with result set size. Tracker size was 4*0.08*#values bytes, which for the test shard is 192MB. For those extra counter instances, the tracker ends up using more memory that the counters it tracks.

What we need is something better to efficiently keep track of the touched values, where efficient is measured both in time and space. In fancy talk, that is a succinct data structure.

Implicit markers

With n-plane counters, each bit of a given counter is stored on a separate plane (at least conceptually): If a counter has a max of 5, it needs 3 bits to store the count and is thus spread across the lower 3 planes.

Suppose we treated the lowest plane separately from the rest of the planes: Instead of supplying the bit at position 0 for the usual binary representation, it simply states if 1 should be added to the number or not. Confused? Let’s represent the numbers from 0 to 8 in standard binary as well as our new system:

Decimal  0  1   2    3    4     5     6     7     8
Binary   0  1  10   11  100   101   110   111  1000  
Binary+  0  1  11  101  111  1001  1011  1101  1111

Note that the representation is not quite as compact as true binary; e.g. the tipping point for going from 3 to 4 planes is from 7 to 8 for binary, but from 4 to 5 for binary+. Loosely estimated, it will cost 1 extra bit/value to use the new representation. However, there is zero extra cost for counters with maximum value 1. It turns out that most of the outgoing links in our net archive corpus are unique, so about 400M of the 600M satisfies this. So the extra memory cost of the extended binary will be 200M / 8 bits/byte = 25MB.

Having a not-zero flag solves a performance issue with the current n-plane counters, but it also opens up for…

Bitmap based sparse tracking

The goal here is to quickly track & locate the counters that have a non-zero value. Underneath the hood, a plane in a n-plane counter is just an array of longs. With implicit markers, checking whether there is one or more counters with non-zero values in a group of 64 counters is a simple matter of checking whether a long is equal to 0.

With our 600M counters, that means about 10M checks, reading 75MB data sequentially from heap. Not bad, but still a substantial fixed overhead.

So what if we create a bitmap with a single bit representing each long in the backing structure of the first plane. If the backing long has 0 updated counters, the bitmap entry is 0, for everything else it is 1. That would require #counters/64 bits or 1.2MB. Locating the non-zero counters then becomes a matter of reading 18K longs sequentially and checking for 0. Now the fixed overhead seems manageable.

The bitmap would of course have to be maintained during updates, introducing an extra get-set to the act of incrementing a counter. However, the old tracker used a single set (and simpler logic), so the difference is not that great.

New idea vs. tried and true

Memory wise the new tracker would need about 26.2MB for 600M counters, where the old one needs 192MB. An obvious improvement.

For very low hit counts, the new tracker would loose: A fixed overhead of 18K checks is still more work than loading a few values directly from an array.

The complexity of the new tracker is far higher, but it might turn out to be fastest in most cases as iteration of the tracked counters are guaranteed to be in sequential order, where the existing tracker is random order.

Measuring N-plane time & space

April 30, 2015 by

This article explores the boundaries of the experimental sparse faceting code, both in terms of processing time and in terms of space requirements. The code base has just been updated and the new features are available as a Solr 4.8 compatible binary download, as well as source.

Faceting overview (you can skip this)

The core performance-critical part of sparse faceting as well as vanilla String fc-faceting for Solr can be divided in three phases:

  1. Count the occurrences of the terms in the facet field
  2. Find the X highest counters
  3. Resolve the Strings for the counters

Sparse faceting started with the idea of improving on #2 by only considering the counters that were actually touched by #1. Later on it improved on #3 when running distributed in SolrCloud. Now #1 is being handled by introducing multi-threaded counting.

Memory-wise there are three main faceting structures:

  1. A map from documents to term ordinals
  2. A counter structure, with 1 counter per unique term
  3. Structures for resolving ordinals to Strings

Memory overhead is very dependent on how the fields are indexed in Solr. The current recommendation for faceting is to index in a StrField (which stores the value verbatim) as well as enabling DocValues, which lowers memory requirements considerably. There is still a base cost of using DocValues (personal rule of thumb: 1 byte/String) with the current default DocValuesFormat, but that is allegedly hard to avoid without compromising performance significantly.

#1: In the usual case, a Solr index consists of multiple segments. In order to do faceting across these segments, a mapping from segment-ordinals to global ordinals is created. This scales linear to the number of terms in all the segments. In the special case of a fully optimized shards, there is no need for this mapping and thus the memory overhead of #1 will be zero. The Danish Net Archive Search uses fully optimized shards.

#2: Vanilla Solr uses a simple int[]-array to hold the counters: Great for low- to medium-cardinality faceting, but quite costly (2.4GB per concurrent call) if you want to facet on 600M unique values. With n-plane counters for sparse faceting, the density of the counting structures is twice the theoretical lower limit for a single counter (pedantic note: It really is a bit more than twice the theoretical lower limit, as we round up to nearest power of two – when somebody implements an efficient arithmetic coding counter, give me a shout). For the shards in the Danish Net Archive, this is about 300MB. Extra counter instances, used for concurrent searches, takes up only the theoretical lower limit, or about 150MB.
To get fast calculation of the top-X counters after the counting step, an extra structure is needed, which brings the total overhead up to about 3-4 times the theoretical lower limit. This can be disabled, but with cardinality in the hundreds of millions, this will impose a very heavy performance penalty.

#3: Having paid the initial cost of using DocValues, the resolving of ordinals to Strings is practically free in terms of memory. With non-DocValued fields, some of the Strings are stored in memory to ensure performing lookups.

Multi threaded faceting

Threading aims to lower latency (response time). It only works when there is a surplus of CPU power, so the typical scenario is setups with heavy but infrequent searches. Infrequent just means that multiple concurrent searches are rare. Vanilla Solr supports multi threaded faceting in two ways:

  1. Multiple facet fields can be processed in parallel.
  2. Using facet.method=fcs, multiple segments can be processed in parallel. However, this comes with a non-trivial merge cost that involves resolving of extra Strings.

Recently in-segment threading was added to sparse faceting. Technically this also covers #2 above, with the added benefit of not needing the merge step.

The tricky part of in-segment threaded counting was to get fast thread-safe updates of the packed counter structures, but for once amount of counters helps: The counter updates are lock-free and opportunistic, using AtomicLongArrays compareAndSet, which is nearly as fast as plain array-updates. In the rare case of collisions, the thread waits the lowest amount of time possible, before trying again.

Counters

The two term occurrence counters relevant to this experiment are:

  1. n-plane, which is small but uses complex logic and requires multiple random memory accesses (an average of 2 reads + 2 writes) when a counter is incremented. Its size, including the sparse structure, is 500MB for this experiment.
  2. packed, which is a variant of Packed64Singleblock using an AtomicLongArray as backing store. It has relatively simple logic and requires only 1 read and 1 write, when a counter is incremented. It saves a bit of space compared to the 2,4GB for vanilla Solr int[], but is three times larger (1,5GB) than n-plane.

Test setup

As we have recently started re-indexing everything in the Danish Net Archive (and since yours truly was a bit too quick on the fingers and accidentally deleted a couple of terabytes of perfectly fine shards), we only have a small sample corpus to test on. Hopefully it will be somewhat representative.

Recap: The Net Archive Search at Statsbiblioteket uses fully optimized shards of 900GB / 250M documents. One of the fields is outgoing links from webpages. Each shard has 600M unique values in the links-field, with 6 billion references from documents to values in that field.

Solr needs 2.2GB of heap just to open 1 such shard and perform a non-faceted search. With n-plane faceting, the minimum heap requirement rises to 3.3GB. In production we use a max heap size of 8GB as we have multiple facet fields and want to have wriggle room.

Due to the aforementioned circumstances, we currently only have 4 such shards in our SolrCloud to test on. They are handled by a 16 core Xeon machine with 256GB of RAM, so the hardware power vs. index size is very favourable, compared to the expected full setup with 23 shards.

Testing was done with random words from a Danish dictionary, one request at a time. The amount of threads used for counting was varied between 1, 2 & 4.

Results

 

4 shards of 900GB / 250M docs with 600M unique values in the facet field, tested with 1, 2 & 4 counting threads

4 shards of 900GB / 250M docs with 600M unique values in the facet field, tested with 1, 2 & 4 counting threads

The same graph, with 80 seconds as max for the y-axis:

4 shards of 900GB / 250M docs with 600M unique values in the facet field, tested with 1, 2 & 4 counting threads

4 shards of 900GB / 250M docs with 600M unique values in the facet field, tested with 1, 2 & 4 counting threads

In the legends, _t1 means 1 thread, _t2 is two threads etc. Some observations:

  1. There is something strange going on with packed in the very low end (1-1000 hits): Responses should be near-instantaneous in that area. This should be investigated further.
  2. packed is markedly faster relative to n-plane, although the overall speed means that the absolute difference is only visible from 100K hits and up.
  3. Threading gives a significant boost from 10M hits and up.

Note that the graphs only cover hit counts up to 100M, although there are 1000M documents in the index. Searching for random words does not really give that many hits: Besides the catch-all-term http, the second most popular term og (Danish for and) only occurs in ⅓ of the corpus.

Getting above 100M hits out of the 1000M documents in the current corpus is certainly possible by issuing or-queries, range-queries or similar, but it it not something that is likely to happen very often during interactive use.

Nevertheless, it must be stressed that worst-case performance is poor. Measuring the catch-all *:*-query (1,066,091,283 hits), the timing was:

Threads N-plane ms Packed ms
1 587,202 310,249
2 321,569 210,712
4 201,532 123,263

For interactive use, random response times in minutes is unacceptable. Besides adding copious amounts of hardware, some ideas for handling that problem are

  • Start by doing a fast non-facet search to get hit count. If that number indicates long faceting time, show a warning to the user. Potentially this could be a rough countdown.
  • Add sampling functionality, where the sample-count only uses the first 1M hits. Note that the hits are not processed in order of relevance, so the sample will be made up of hits from the beginning of the overall index structure (most likely the oldest records).
    A more advanced sampler could advance through the result set to distribute the sampling points better.

The small amount of shards means that there is a lot of CPU power free on the test machine. As it gets filled, there will be less surplus power and the effect of threading is likely to go down. Our setup might end up not getting any advantage out of threaded counting.

On the other hand, the low heap overhead of faceting with n-plane means that we will probably enable faceting on the 600M unique values field. It makes sense to pay the 1100MB heap overhead per shard to get fine-grained graph analysis; especially when the cost of additional counter instances is only about 350MB.

Facet filtering

April 10, 2015 by

In generation 2 of our net archive search we plan to experiment with real time graphs: We would like to visualize links between resources and locate points of interest based on popularity. Our plan is to use faceting with Solr on 500M+ unique links per shard, which is a bit of challenge in itself. To make matters worse, plain faceting does not fully meet the needs of the researchers. Some sample use cases for graph building are

  1. The most popular resources that pages about gardening links to overall
  2. The most popular resources that pages on a given site links to externally
  3. The most popular images that pages on a given site links to internally
  4. The most popular non-Danish resources that Danish pages links to
  5. The most popular JavaScripts that all pages from a given year links to

Unfortunately, only the first one can be solved with plain faceting.

Blacklists & whitelists with regular expressions

The idea is to filter all viable term candidates through a series of blacklists and whitelists to check whether they should be part of the facet result or not. One flexible way of expressing conditions on Strings is with regular expressions. The main problem with that approach is that all the Strings for the candidates must be resolved, instead of only the ones specified by facet.limit.

Consider the whitelist condition .*wasp.* which matches all links containing the word wasp. That is a pretty rare word overall, so if a match-all query is issued and the top 100 links with the wasp-requirement are requested, chances are that millions of terms must be resolved to Strings and checked, before the top 100 allowed ones has been found. On the other hand, a search for gardening would likely have a much higher chance of wasp-related links and would thus require far less resolutions.

An extremely experimental (written today) implementation of facet filtering has been added to the pack branch of Sparse Faceting for Solr. Correctness testing has been performed, where testing means “tried it a few times and the results looked plausible”. Looking back at the cases in the previous section, facet filtering could be used to support them:

  1. The most popular resources that pages about gardening links to overall
    q=gardening
  2. The most popular resources that pages on a given site links to externally
    q=domain:example.com&facet.sparse.blacklist=https?://[^/]*example\.com
  3. The most popular images that pages on a given site links to internally
    q=domain:example.com&facet.sparse.whitelist=https?://[^/]*example\.com/.*\.(gif|jpg|jpeg|png)$
  4. The most popular non-Danish resources that Danish pages links to
    q=domain_suffix:dk&facet.sparse.blacklist=https?://[^/]*\.dk
  5. The most popular JavaScripts that all pages from a given year links to
    q=harvest_year:2015&facet.sparse.whitelist=.*\.js$

Some questions like “The most popular resources larger than 2MB in size linked to from pages about video” cannot be answered directly with this solution as they rely on the resources at the end of the links, not just the links themselves.

Always with the performance testing

Two things of interest here:

  1. Faceting on 500M+ unique values (5 billion+ DocValues references) on a 900GB single-shard index with 200M+ documents
  2. Doing the trick with regular expressions on top

Note the single-shard thing! The measurements should not be taken as representative for the final speed of the fully indexed net archive, which will be 50 times larger. As we get more generation 2 shards, the tests will hopefully be re-run.

As always, Sparse Faceting is helping tremendously with the smaller result sets. This means that averaging the measurements to a single number is highly non-descriptive: Response times varies from < 100ms for a few thousand hits to 5 minutes for a match-all query.

Performance testing used a single thread to issue queries with random words from a Danish dictionary. The Solr server was a 24 core Intel i7 machine (only 1 active core due to the unfortunate single-threaded nature of faceting) with 256GB of RAM (200GB free for disk cache) and SSDs. All tests were with previously unused queries. 5 different types of requests were issued:

  1. no_facet: as the name implies, just a plain search with no faceting
  2. sparse: Sparse Faceting on the single links-field with facet limit 25
  3. regexp_easy: Sparse Faceting with whitelist regular expression .*htm.* which is fairly common in links
  4. regexp_evil: Sparse Faceting with whitelist regular expression .*nevermatches.* effectively forcing all terms in the full potential result set to be resolved and checked
  5. solr: Vanilla Solr faceting
900GB, 200M+ docs, 500M+ unique values, 5 billion+ references

900GB, 200M+ docs, 500M+ unique values, 5 billion+ references

Observations

  • Sparse Faceting without regular expressions (purple) performs just as well with 500M+ values as it did with previous tests of 200M+ values.
  • Using a regular expression that allows common terms (green) has moderate impact on performance.
  • The worst possible regular expression (orange) has noticeable impact at 10,000 hits and beyond. At the very far end at match-all, the processing time was 10 minutes (versus 5 minutes for non-regular expression faceting). This is likely to be highly influenced by storage speed and be slower with more shards on the same machine.
  • The constant 2 second overhead of vanilla Solr faceting (yellow) is highly visible.

Conclusion

Worst case processing times has always been a known weakness of our net archive search. Facet filtering exacerbates this. As this is tightly correlated to the result set size, which is fast to calculate, adding a warning with “This query is likely to take minutes to process” could be a usable bandage.

With that caveat out of the way, the data looks encouraging so far; the overhead for regular expressions was less than feared. Real-time graphs or at least fill-the-coffee-cup-time graphs seems doable. At the cost of 2GB of extra heap per shard to run the faceting request.

Additional notes 2015-04-11

@maxxkrakoa noted “@TokeEskildsen you wrote Solr facet is 1 thread. facet.threads can change that – but each field will only be able to use one thread each.“. He is right and it does help significantly for our 6 field faceting. For single field faceting, support for real multi-thread counting would be needed.

The simple way of doing multi-thread counting is to update multiple copies of the counter structure and merge them at the end. For at 500M+ field, that is likely to be prohibitive with regards to both memory and speed: The time used for merging the multiple counters would likely nullify the faster counter update phase. Some sort of clever synchronization or splitting of the counter space would be needed. No plans yet for that part, but it has been added to “things to ponder when sleep is not coming”-list.

Additional notes 2015-04-13

It seems that Java 1.7’s AtomicLongArray performs very well: Quite comparable to plain long[] in the context of multi-millions of counters, where contention is low. This raises the probability of implementing true threaded faceting markedly.

N-plane packed counters for faceting

March 13, 2015 by

Faceting in Solr works well out of the box up to some millions of unique values in the facet field. There is a small performance penalty linear to the number of unique values, which begins to show after some point. Sparse faceting solves this and makes faceting performance linear to the result size. Sparse faceting also reduces memory requirements by packing the values tighter. As seen in the blog post Long tail packed counters for faceting, this packing can be improved, depending on the concrete term distribution in the facet field.

An implementation of Long tail packed counters has been created and renamed to Dual-plane counters. During development, a new idea sprung to life in the form of N-plane counters. In this blog post the different counters will be explained and performance measurements from a micro-benchmark (yes, we know, they are not very reliable) based on real-world data will be presented.

Quick re-cap from the long tail blog post:

  • We have hundreds of millions of counters, referenced by ordinal
  • Each counter corresponds to a unique term in a facet field
  • We know the maximum for each individual counter
  • The overall distribution of the maxima is long tail
  • We want the overall counter structure to be small
  • We want the overall counter structure to be fast

Instead of viewing the maxima as plain numbers, they can be viewed as required bits. If a counter has a maximum of 15, all possible values can be expressed with 4 bits, as 2⁴-1 = 15. Similarly a counter with a maximum of 100 needs 7 bits, as 2⁷-1 = 127. The collection of counters can be visualized as

Term counters as bits

The counter for term A needs 1 bit, term B needs 3 bits, term C needs 1 bit and term D needs 11 bits. If we sorted the terms according to maxima instead of alphanumeric, the distribution would look like

Shape of maxima sorted by maxima

However, as the order of the terms is dictated by Solr, sorting the terms for use with faceting would require a mapping from Solr’s term ordinals to the position in the sorted list. This option has not been explored yet, so for now we’ll stick with the original order.

Test data

We extracted the maxima for the 640M unique links in a test-shard from our net archive and grouped them by the number of bits needed for expressing those maxima. The long tail distribution is evident. The theoretically optimal worst-case representation of the counters is the collective number of bits needed (the white squares seen above). For the sample below that is 138MB.

Bits #terms
1 425,799,733
2 85,835,129
3 52,695,663
4 33,153,759
5 18,864,935
6 10,245,205
7 5,691,412
8 3,223,077
9 1,981,279
10 1,240,879
11 714,595
12 429,129
13 225,416
14 114,271
15 45,521
16 12,966
17 4,005
18 1,764
19 805
20 789
21 123
22 77
23 1

The contestants

Counter-representation with atomic numeric

Stock Solr: Counter-representation with atomic numeric (y goes to 32, only bit 1-16 shown)

Stock Solr faceting uses an int[] to hold the counts. It is simple and fast. In the visualization above, the white squares are bits holding counter values, while the red area represents bits that are always 0. There is a lot of red with this counter structure. The sample counters takes up 2442MB with Stock Solr faceting. Pro: speed, con: space.

Sparse Faceting: Counter-representation with PackedInts

Sparse Faceting: Counter-representation with PackedInts

Sparse Faceting has the option of using PackedInts for counters. It lowers the overhead ceiling, but there is still a lot of wasted space. With Sparse Faceting, 1755MB is needed for the sample counters. Pro: speed, con: space + needs to know overall maximum.

Dual-plane: Counter-representation in low-bits with references to high-bit

Dual-plane: Counter-representation in low-bits with references to high-bit

Holding low counts in one structure and switching high-counts to another structure requires 1 extra bit/value for signalling which counters has a high value (see Long tail packed counters for faceting for details). Even with the extra bit, this gives an overall size reduction. Dual-plane uses 1221MB for the sample counters. Pro: speed, con: needs histogram of maxima for initialization.

N-plane: Horizontal slicing of counters

N-plane: Horizontal slicing of counters

Extending Dual-plane to an arbitrary number of planes and replacing the explicit pointers with counting logic, N-plane worst-case representation takes up just 2x the size of the raw bits. In the illustration above, the blue boxes are bits representing overflow up to the next plane. The overflow bits are counted to calculate the offset in the higher planes.
To avoid horrible performance, a rank structure is needed, which adds a bit of overhead. Likewise, it might be advisable to limit the number of planes, to avoid too many random memory requests. Most promising space/performance setup for the sample data takes up 341MB for N-plane, with the smallest & slowest taking up 275MB. Pro: space, con: speed + needs all counter maxima for initialization.

Bonus: All the logistic parts of the structure (the blue squares and the rank-cache) are static and can thus be shared between counters. If there are more than 1 concurrent faceted search at a time, each extra counting structure only holds the theoretically smallest worst case of bits, which for this sample is 138MB.

Testing

The micro-benchmark creates 640M sample maxima, randomly generated to follow the bit-histogram from the links-field in our sample shard, and initializes the different counter structures based on this. A list of ordinals is created with random selection, checked to ensure that each unique ordinal only occurs a number of times that is within the corresponding counter’s maximum. Caveat: Random updates are not optimal as the distribution of updates is not random for real searches: It should follow the maxima-distribution. The updates are applied to each counter implementation and performance measured as the median from 9 test-runs.

N-plane is special as it can be configured in various ways. In the schema below, N-Plane(#4, 1/100) means that there are 4 planes and overflow-bits are cached for every 100 bits.

Median updates/ms of 640M counters with max bit 23, for different amounts of updates
Implementation MB 1M upd 5M upd 10M upd 50M upd 100M upd 500M upd
N-plane(#2, 1/1K) 718 20853 13340 13059 15752 12590 4000
N-plane(#4, 1/1K) 311 20887 13339 13058 15749 12554 3969
N-plane(#6, 1/1K) 275 13375 20129 19492 11187 9482 4451
N-plane(#2, 1/100) 745 20938 13548 13460 19146 17420 7674
N-plane(#4, 1/100) 341 20929 13526 13447 18874 17035 7787
N-plane(#6, 1/100) 306 20444 13317 13222 18948 17435 7212
N-plane(#2, 1/20) 865 14027 18900 18773 13261 12380 10512
N-plane(#4, 1/20) 473 13086 20533 20386 12388 11605 11554
N-plane(#6, 1/20) 440 13423 21050 20909 12769 12003 11936
Dual-plane 1221 18734 29952 29940 18797 18794 29911
PackedInts.COMPACT 1755 13518 15324 15346 13562 13565 15296
PackedInts.FAST 2442 17995 18955 19109 19123 19142 19233

Observations

  • Performance of N-plane gets worse with the number of updates, which is to be expected: As the counts gets higher, it needs to visit more planes.
  • Dual-plane has suspicious values for 50M and 100M updates, which mirrors suspicious values for COMPACT at the same amunts of updates. As COMPACT should have the same updates/ms performance independent of the number of updates, this indicates a testing problem.
  • PackedInts.FAST is a lot faster than PackedInts.COMPACT. This might be due to chance: The number of significant bits is 23, which aligns very badly with the 64 bit longs used by the COMPACT implementation. Had the number of significant bits been 21 or 24, things might have looked different.

Conclusion

The two new players Dual-plane and N-plane are bot very interesting for high-cardinality faceting: Dual-plane as a direct replacement of PackedInts.COMPACT, N-plane as a very low-memory alternative when speed is not of the essence.

The drawback of Dual-plane is that is needs a histogram of maxima. Fortunately this is not costly: It is about the same work as is needed by PackedInts.COMPACT, which requires the overall maximum.

The drawback of N-plane is a lot higher as it needs the full maxima-set for initializing. Performance-wise this is not much of a change from the PackedInts.COMPACT maximum calculation, but it does require temporary memory in the order of 4*term_count, which in this case is 2.5GB.

Net archive indexing, round 2

March 9, 2015 by

Using our experience from our initial net archive search setup, Thomas Egense and I have been tweaking options and adding patches to the fine webarchive-discovery from UKWA for some weeks. We will be re-starting indexing Real Soon Now. So what have we learned?

  • Stored text takes up a huge part of the index: Nearly half of the total index size. The biggest sinner is not surprisingly the content field, but we need that for highlighting and potentially text extraction from search results. As we have discovered that we can avoid storing DocValued fields, at the price of increased document retrieval time, we have turned off storing for several fields.
  • DocValue everything! Or at least a lot more than we did initially. Enabling DocValues for a field and getting low-overhead faceting turned out to be a lot disk-space-cheaper than we thought. As every other feature request from the researchers seems to be “We would also like to facet on field X”, our new strategy should make them at least half happy.
  • DocValues are required for some fields. Due to internal limits on facet.method=fc without DocValues, it is simply not possible to do faceting if the number of references gets high.
  • Faceting on outgoing links is highly valuable. Being able to facet on links makes it possible to generate real-time graphs for interconnected websites. Links with host- or domain granularity are easily handled and there is no doubt that those should be enabled. Based on posivitive experimental results with document-granularity links faceting (see section below), we will also be enabling that.
  • The addition of performance instrumentation made it a lot easier for us to prioritize features. We simply do not have time for everything we can think of and some specific features were very heavy.
  • Face recognition (just finding the location of faces in images, not guessing the persons)  was an interesting feature, but with a so-so success rate. Turning it on for all images would triple our indexing time and we have little need for sampling in this area, so we will not be doing it at all for this iteration.
  • Most prominent colour extraction was only somewhat heavy, but unfortunately the resulting colour turned out to vary a great deal depending on adjustment of extraction parameters. This might be useful if a top-X of prominent colours were extracted, but for now we have turned off this feature.
  • Language detection is valuable, but processing time is non-trivial and rises linear with the number of languages to check. We lowered the number of detected languages from 20 to 10, pruning the more obscure (relative to Danish) languages.
  • Meta-data about harvesting turned out to be important for the researchers. We will be indexing the ID of the harvest-job used for collecting the data, the institution responsible and some specific sub-job-ID.
  • Disabling of image-analysis features and optimization of part of the code-base means faster indexing. Our previous speed was 7-8 days/shard, while the new one is 3-4 days/shard. As we has also doubled our indexing hardware capacity, we expect to do a full re-build of the existing index in 2 months and catching up to the present within 6 months.
  • Our overall indexing workflow, with dedicated builders creating independent shards of a fixed size, worked very well for us. Besides some minor tweaks, we will not be changing this.
  • We have been happy with Solr 4.8. Solr 5 is just out, but as re-indexing is very costly for us, we do not feel comfortable with a switch at this time. We will do the conservative thing and stick to the old Solr 4-series, which currently means Solr 4.10.4.

Document-level links faceting

The biggest new feature will be document links. This is basically all links present on all web pages at full detail. For a single test shard with 217M documents / 906GB, there were 7 billion references to 640M unique links, the most popular link being used 2.4M times. Doing a full faceted search on *:* was understandable heavy at around 4 minutes, while ad hoc testing of “standard” searches resulted in response times varying from 50 ms to 3500 ms. Scaling up to 25 shards/machine, it will be 175 billion references to 16 billion values. It will be interesting to see the accumulated response time.

We expect this feature to be used to generate visual graphs of interconnected resources, which can be navigated in real-time. Or at least you-have-to-run-to-get-coffee-time. For the curious, here is the histogram for links in the test-shard:

References #terms
1 425,799,733
2 85,835,129
4 52,695,663
8 33,153,759
16 18,864,935
32 10,245,205
64 5,691,412
128 3,223,077
256 1,981,279
512 1,240,879
1,024 714,595
2,048 429,129
4,096 225,416
8,192 114,271
16,384 45,521
32,768 12,966
65,536 4,005
131,072 1,764
262,144 805
524,288 789
1,048,576 123
2,097,152 77
4,194,304 1

 

Long tail packed counters for faceting

February 26, 2015 by

Our home brew Sparse Faceting for Solr is all about counting: When calculating a traditional String facet such as

Author
- H.C.Andersen (120)
- Brothers Grimm (90)
- Arabian Nights (7)
- Carlo Collodi (5)
- Ronald Reagan (2)

the core principle is to have a counter for each potential term (author name in this example) and update that counter by 1 for each document with that author. There are different ways of handling such counters.

Level 0: int[]

Stock Solr uses an int[] to keep track of the term counts, meaning that each unique term takes up 32 bits or 4 bytes of memory for counting. Normally that is not an issue, but with 6 billion terms (divided between 25 shards) in our Net Archive Search, this means a whopping 24GB of memory for each concurrent search request.

Level 1: PackedInts

Sparse Faceting tries to be clever about this. An int can count from 0 to 2 billion, but if the maximum number of documents for any term is 3000, there will be a lot of wasted space. 2^12 = 4096, so in the case of maxCount=3000, we only need 12 bits/term to keep track of it all. Currently this is handled by using Lucene’s PackedInts to hold the counters. With the 6 billion terms, this means 9GB of memory. Quite an improvement on the 24GB from before.

Level 2: Long tail PackedInts with fallback

Packing counters has a problem: The size of all the counters is dictated by the maxCount. Just a single highly used term can nullify the gains: If all documents share one common term, the size of the individual counters will be log(docCount) bits. With a few hundred millions of documents, that puts the size to 27-29 bits/term, very close to the int[] representation’s 32 bits.

Looking at the Author-example at the top of this page, it seems clear that the counts for the authors are not very equal: The top-2 author has counts a lot higher than the bottom-3. This is called a long tail and it is a very common pattern. This means that the overall maxCount for the terms is likely to be a lot higher than the maxCount for the vast majority of the terms.

While I was loudly lamenting of all the wasted bits, Mads Villadsen came by and solved the problem: What if we keep track of the terms with high maxCount in one structure and the ones with a lower maxCount in another structure? Easy enough to do with a lot of processing overhead, but tricky do do efficiently. Fortunately Mads also solved that (my role as primary bit-fiddler is in serious jeopardy). The numbers in the following explanation are just illustrative and should not be seen as the final numbers.

The counter structure

We have 200M unique terms in a shard. The terms are long tail-distributed, with the most common ones having maxCount in the thousands and the vast majority with maxCount below 100.

We locate the top-128 terms and see that their maxCount range from 2921 to 87. We create an int[128] to keep track of their counts and call it head.

head
Bit 31 30 29 2 1 0
Term h_0 0 0 0 0 0 0
Term h_1 0 0 0 0 0 0
0 0 0 0 0 0
Term h_126 0 0 0 0 0 0
Term h_127 0 0 0 0 0 0

The maxCount for the terms below the 128 largest ones is 85. 2^7=128, so we need 7 bits to hold each of those. We allocate a PackedInt structure with 200M entries of 7+1 = 8 bits and call it tail.

tail
Bit 7* 6 5 4 3 2 1 0
Term 0 0 0 0 0 0 0 0 0
Term 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Term 199,999,998 0 0 0 0 0 0 0 0
Term 199,999,999 0 0 0 0 0 0 0 0

The tail has an entry for all terms, including those in head. For each of the large terms in head, we locate its position in tail. At the tail-counter, we set the value to term’s index in the head counter structure and set the highest bit to 1.

Let’s say that head entry term h_0 is located at position 1 in tail, h_126 is located at position 199,999,998 and h_127 is located at position 199,999,999. After marking the head entries in the tail structure, it would look like this:

tail with marked heads
Bit 7* 6 5 4 3 2 1 0
Term 0 0 0 0 0 0 0 0 0
Term 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Term 199,999,998 1 1 1 1 1 1 1 0
Term 199,999,999 1 1 1 1 1 1 1 1

Hanging on so far? Good.

Incrementing the counters

  1. Read the counter value from the tail structure: count = tail.get(ordinal)
  2. Check if bit 7 is set: if (count & 128 == 128)
  3. If bit 7 is set, increment the head counter: head.inc(count & 127)
  4. If bit 7 is not set, increment the tail counter: tail.set(ordinal, count+1)

Pros and cons

In this example, the counters takes up 6 billion * 8 bits + 25 * 128 * 32 bits = 5.7GB. The performance overhead, compared to the PackedInts version, is tiny: Whenever a head bit is encountered, there will be an extra read to get the old head value before writing the value+1. As head will statistically be heavily used, it is likely to be in Level 2/3 cache.

This is just an example, but it should be quite realistic as approximate values from the URL field in our Net Archive Index has been used. Nevertheless, it must be stressed that the memory gains from long tail PackedInts is highly dictated by the shape of the long tail curve.

Afterthought

It is possible to avoid the extra bit in tail by treating the large terms as any other term, until their tail-counters reaches maximum (127 in the example above). When a counter’s max has been reached, the head-counter can then be located using a lookup mechanism, such as a small HashMap or maybe just a linear scan through a short array with the ordinals and counts for the large terms. This would reduce the memory requirements to approximately 6 billion * 7 bits = 5.3GB. Whether this memory/speed trade-off is better or worse is hard to guess and depends on result set size.

Implementation afterthought

The long tail PackedInts could implement the PackedInts-interface itself, making it usable elsewhere. Its constructor could take another PackedInts filled with maxCounts or a histogram with maxbit requirements.

Heureka update 2015-02-27

There is no need to mark the head terms in the tail structure up front. All the entries in tail acts as standard counters until the highest bit is set. At that moment the bits used for counting switches to be a pointer into the next available entry in the head counters. The update workflow thus becomes

  1. Read the counter value from the tail structure: count = tail.get(ordinal)
  2. Check if bit 7 is set: if (count & 128 == 128)
  3. If bit 7 is set, increment the head counter: head.inc(count & 127)
  4. If bit 7 is not set, increment the tail counter: tail.set(ordinal, count+1)
  5. If the counter reaches bit 7, change the counter-bits to be pointer-bits:
    if ((count+1) == 128) tail.set(ordinal, 128 & headpos++)

Pros and cons

The new logic means that initialization and resetting of the structure is simply a matter of filling them with 0. Update performance will be on par with the current PackedInts implementation for all counters, whose value is within the cutoff. After that the penalty of an extra read is paid, but only for the overflowing values.

The memory overhead is unchanged from the long tail PackedInts implementation and still suffers from the extra bit used for signalling count vs. pointer.

Real numbers 2015-02-28

The store-pointers-as-values has the limitation that there can only be as many head counters as the maxCount for tail. Running the numbers on the URL-field for three of the shards in our net archive index resulted in bad news: The tip of the long tail shape was not very pointy and it is only possible to shave 8% of the counter size. Far less than the estimated 30%. The Packed64 in the table below is the current structure used by sparse faceting.

Shard 1 URL: 228M unique terms, Packed64 size: 371MB
tail BPV required memory saved head size
11 342MB 29MB / 8% 106
12 371MB 0MB / 0% 6

However, we are in the process of experimenting with faceting on links, which has quite a higher point in the long tail shape. From a nearly fully build test shard we have:

8/9 build shard links: 519M unique terms, Packed64 size: 1427MB
tail BPV required memory saved head size
15 1038MB 389MB / 27% 14132
16 1103MB 324MB / 23% 5936
17 1168MB 260MB / 18% 3129
18 1233MB 195MB / 14% 1374
19 1298MB 130MB / 9% 909
20 1362MB 65MB / 5% 369
21 1427MB 0MB / 0% 58

For links, the space saving was 27% or 389MB for the nearly-finished shard. To zoom out a bit: Doing faceting on links for our full corpus with stock Solr would take 50GB. Standard sparse faceting would use 35GB and long tail would need 25GB.

Due to sparse faceting, response time for small result sets is expected to be a few seconds for the links-facet. Larger result sets, not to mention the dreaded *:* query, would take several minutes, with worst-case (qualified guess) around 10 minutes.

Three-level long tail 2015-02-28

Previously:

  • pointer-bit: Letting the values in tail switch to pointers when they reach maximum has the benefit of very little performance overhead, with the downside of taking up an extra bit and limiting the size of head.
  • lookup-signal: Letting the values in tail signal “find the counter in head” when they reach maximum, has the downside that a sparse lookup-mechanism, such as a HashMap, is needed for head, making lookups comparatively slow.

New idea: Mix the two techniques. Use the pointer-bit principle until there is no more room in head. head-counters beyond that point all get the same pointer (all value bits set) in tail and their position in head is determined by a sparse lookup-mechanism ord2head.

This means that

  • All low-value counters will be in tail (very fast).
  • The most used high-value counters will be in head and will be referenced directly from tail (fast).
  • The less used high-value counters will be in head and will require a sparse lookup in ord2head (slow).

Extending the pseudo-code from before:

value = tail.get(ordinal)
if (value == 255) { // indirect pointer signal
  head.inc(ord2head.get(ordinal))
} else if (value & 128 == 128) { // pointer-bit set
  head.inc(value & 127)
} else { // term-count = value
  value++;
  if (value != 128) { // tail-value ok
    tail.set(value)
  } else { // tail-value overflow
    head.set(headpos, value)
    if (headpos < 127) { // direct pointer
      tail.set(128 & headpos++)
    } else { // indirect pointer
      tail.set(255)
      ord2head.put(ordinal, headpos++)
    }
  }
}

Exhaustive ID filters in Solr

February 4, 2015 by

Analysis of the material in our Net Archive index is challenging for many reasons. One of them is agreeing on what we’re looking at: It is rarely the full amount of data; our researchers wants to define subsets. Subsets can be sliced in different ways: At (W)ARC level, as manually selection of specific URLs harvested at specific times, as a graph of links etc. They can also be specified with Solr. Using Solr, there are two common and technically distinct ways of specifying subsets:

  • Subset definition by simple boolean expressions: “All material from the domain example.com harvested in 2011″ or “All PDF files that mentions ‘disneyland after dark’ and has a link to domain:d-a-d.dk”.
    Such subsets are extremely easy to work with in Solr, as they are just filters: They are very fast and works fully with faceting, sorting and grouping.
  • Subset definition by weighting: “One instance of all URLs, de-duplicated by taking the one as close to June 1., 2011 as possible”.
    For some functionality, such as extracting simple top-X results, this can be handled by grouping in Solr. Unfortunately this scales poorly with faceting, groups within groups is not a Solr feature and some functionality, such as counting the number of distinct groups (the result set size) is not possible to do reliably with SolrCloud.

So what can we do about the pesky subset-by-weighting? The idea is to add support for exhaustive ID filters, where exhaustive means potentially all documents at single-document granularity level. With such a filter we could define arbitrary subsets on an existing index, not even constrained by Solr’s own powers of expression. It looks a bit daunting with billions of documents, but given some constraints it seems doable.

Creating the document list

The first problem is to determine which documents should be in the filter. If the documents are defined outside of Solr, this could just be a list of [URL, timestamp] or similar parameters uniquely identifying the documents. If the documents are to be aggregated from the SolrCloud, some manual work might be needed.

Let’s see how to do it for “One instance of all URLs, de-duplicated by taking the one as close to June 1., 2011 as possible”. In a single-shard setup it might be possible to do it with grouping on URL (1 entry/group) and sorting by distance. As stated before, this scales badly or not at all with SolrCloud and things like faceting or paging through result sets. But the principle itself is simple:

  1. For each shard, extract all the documents as tuples of [URL, date, ID], sorted by URL. This is very easily done with deep paging. If further restrictions, such as “only documents from 2010-2012″ are needed, this is done by applying a filter before extraction.
  2. Process the resulting lists individually and extend the tuples with shard-ID to [URL, date, ID, shardID].
  3. Merge-sort the list of tuples, primarily by URL, secondarily by closeness to 2011-06-01.
  4. Prune the aggregated list by only keeping the first entry for each URL.
  5. Split the aggregated list into shard-specific lists, using the shard-ID part of the tuples, keeping only the IDs.

The astute reader will have noticed that steps 1-5 can be performed in a streaming manner: The only limit to the size of the resulting ID-lists is storage space.

Named filters

At this point we have a list of entries uniquely defining documents, for each shard. We need a Solr component that can

  1. Take a user-assigned filter-ID and an URL to the list of document-definitions.
  2. Create an empty docset/bitmap, the same kind a standard filter uses.
  3. For each entry in the list of document-definitions, resolve the internal docID for the document and set that bit in the bitmap.
  4. Store the resulting bitmap for later usage, just like a standard filter, with the user-assigned filter-ID as key.
  5. When the user issues queries specifying filter-IDs, the docID bitmaps are fetched and applied as a standard Solr filters, with the same minimal processing overhead.

There is a lot of potential for optimization here:

  • It would be possible for each line in the list of document-definitions to be viewed as a full-blown Solr query, seeing the whole list as a giant boolean OR query. Lines could be processed in batches by enclosing each line in the batch in parentheses and interleaving with OR, before processing it as a plain query.
  • Treating each line as a query allows the filter to be specified any the user wants: While the granularity of the list goes down to document level, it also goes up to the full index.
  • It would be possible to see each line as a term that must match a single field (normally the ID field). Again requests could be batched, this time using the very efficient TermsQuery.
  • For ID-based lists, pre-sorting the list would open up for very efficient processing: The Solr query mechanism could be bypassed and the internal structure for the ID-field could be traversed sequentially.

Once again this can be streamed (except for the sorting idea), with no limit on the size of the list of document-definitions. Theoretically this could be tied to the generation of the corpus, avoiding temporary storage of the lists. But that would be a later project.

What we get

  • The ability for researchers to specify shared subsets of any size at document granularity, without changing the indexes.
  • Besides relevance ranking, using such filters would be equivalent to building a whole new index from the defined subset.
  • High initial cost of defining and creating filters, but extremely low subsequent overhead of using them.

Caveats

Creating the document lists would likely be very heavy, but as the premise is to create subsets of the corpus, this would normally be done once per subset defined by the researchers.

Creating the filter itself would likely be less heavy, but the processing of hundreds of millions of tiny queries, even with batching, is measured in minutes or hours. Due to the nature of filters, this step must be repeated if the index changes, so exhaustive filters would only be feasible with a rarely-updated index.Some of Solr’s build-in functionality, such as the distance to a given date for the example case, would have to be duplicated in the external merge-sorter.

Prior work

Constraining searches by large lists of IDs is not a new idea.

  • Defining the IDs at query time is not feasible at net archive scale; even with bit packing a request could potentially be measured in gigabytes.
  • Treating the task as a security filter problem holds promises, at least for the filtering part.
  • Adding a token to the documents in a subset would work, but it makes experimentation very heavy and (of course) requires the indexes to be updated.
  • The Terms Filter Lookup in Elasticsearch seems quite like the persistent filter described above. As does SOLR-1715, which unfortunately is talk-only at this point.
  • Update 20140205: The example of generating the document list seems to fit extremely well with Heliosearch’s Streaming Aggregation, which will hopefully be part of Solr with SOLR-7082.

Follow

Get every new post delivered to your Inbox.