Ten times faster

August 26, 2014 by

One week ago I complained about Solr’s two-phase distributed faceting being slow in the second phase – ten times slower than the first phase. The culprit was the fine-counting of top-X terms, with each term-count being done as an intersection between regular hits and a special search for the term in question.

Let’s have a quick look at the numbers from last week (note that the scales are logarithmic):

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

Imprecise facet counting aka cheating

The simple way to get fast distributed faceting for high cardinality fields is to skip the second phase and accept that the term counts for faceting might be a bit off, where “a bit” is highly dependent on corpus & query. An extremely quick ad-hoc test with our corpus suggested around 10% deviation from the correct counts. The skipping requires just 3 lines of code, strategically placed.

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

Apologies for the colors not matching between the charts. The median for plain Solr faceting is 1608 ms. For imprecise faceting counting, using sparse faceting for first phase, it is 168 ms. Quite a speed up! Read Whale hunting with Solr for an explanation of the weird response times below 100 hits.

Since we are already cheating and getting imprecise counts, we might as well limit the maximum count for each term. In our 12 shards, the maximum count for a single URL in any shard is a little below 5000, with the long tail very quickly dropping. Most counts are below 10 in a single shard. With a count of 5000, we need 13 bits to hold the counter, meaning 3.6 billion terms / 13 bits/term ~= 5.44 GB for all counter structures or about 0.45 GB / shard / thread. If we lower this max count to 255 / shard, so that a single counter fits in a byte, we get faster faceting and reduce the memory overhead to 3.6 GB total or 300 MB / shard / thread.

Alas, some of us think that all this cheating is getting out of hand…

Once more, with feeling!

It was possible to speed the first phase of Solr faceting by doing sparse counting, so let’s try that again: For the second phase, we do a near complete repetition of the first phase, so the counts for all terms in the facet field are calculated. However, instead of calculating and extracting the top-X terms, only the counts for the requested terms are extracted from the counting structure. Extraction of a count from the structure requires resolving of the ordinal for the term in question. This does take some time, but the hope was that this would not give too much overhead. So did it help?

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting


This is getting a bit chaotic and it is hard to see all the cyan dots hiding between the green ones. Trusty old percentile plot to the rescue:

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting with sparse term lookup

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting with sparse term lookup

Now we can see! With the second phase of faceting being nearly as fast as first phase, total faceting time for small result sets is looking quite good. If we lump all the measurements for each method together, we get

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

Note how the median for skip_secondary is a lot lower than for the previous test run – it seems that someone else was using the machine at that time. The outputs has been verified by random inspection: It really only takes a few hundred milliseconds to facet on small result sets out of more than 3 billion documents on a single machine. Just as it should, one might add. It remains to be tested if smaller setups benefits just as much.

We’re not done yet

The implementation is not ideal as the exact same work – counting of all term occurrences – is very often done twice. It would be a lot better to cache it. When releasing a counter structure to the pool of structures, it could be released with a tag stating if it would probably be used again (first phase of distributed faceting) or if it would probably not be needed any more (after the second phase). Guesstimating, this should shave 30-40% of the total time for faceting with sparse term lookup.

Should anyone want to try sparse faceting for themselves, then visit https://github.com/tokee/lucene-solr and check out branch lucene_solr_4_8_sparse or lucene_solr_4_9_sparse. You will need an existing Solr index for proper testing. Refer to the file SparseKeys.java for options. The defaults works fine if the parameters facet.sparse=true and facet.sparse.termlookup=true are given and the requested facet field has over 10,000 unique values and facet.method=fc. To disable the second phase completely, add the parameter facet.sparse.skiprefinements=true. Proper documentation pending.

If you want to see this in Solr, visit SOLR-5894 and vote for it.

Update 20140826 22:26

To verify things, the experiment was repeated with 10 minute running time for each faceting method (as opposed to 3 minutes). This did not affect the conclusion, but might add a little bit of information.

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

The first thing to note is how response time rises near linear to result set size, when the result set is 5M or more. It would be interesting to investigate what happens around the 300M (8% of the corpus size) mark, which is the limit chosen for sparse faceting for this setup.

The second observation is that all methods except stock Solr (the blue dots) seems to have two clusters, one below 100ms and one above. As the no_facet method is single phase (note to self: Double check if this is true), this cannot be explained by the second phase being skipped. Maybe there is some caching effect? The queries should be as good as unique, so it is not just because of simple request caching.

For an alternative illustration, here’s the same data as above but without the logarithmic y-scale:

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

Ten times slower

August 15, 2014 by

I jumped the gun on our current web index status with our whale hunting safari, but it turns out that there are other fish to kill (sorry, whales are not fish and I’ll stop with the nautical metaphors immediately). This time we’re talking faceting on billions of documents.

At Statsbiblioteket we are building a SolrCloud web index of our harvested material. 9 months ago that was 372 TB or 10 billion entities. It has probably passed 400 TB by now. Our insane plan was to index it, put the indexes on a single machine and to do faceted searches. Because it made sense and maybe a little because it is pure awesome to handle that amount of information from a single box. Read about our plans in the post Webscale in Danish.

It has been about three months since we last checked how things were with our searchable web archive at Statsbiblioteket. Back then it contained 4 shards for a total of 3.6 TB or 1.2 billion documents, on our single 16 core machine with 256 GB of RAM. Simple searches were fast and faceting on URL (nearly unique for every document) equally so, even when we lowered the amount of RAM to 80 GB, which meant that only 1% of the total index data could be cached in RAM. The two graphs below illustrates the state back then.

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

Sparse faceting is markedly better than stock Solr under our test; performance is very satisfactory with most response times well below 1 second. It is important to note that the distributed faceting calls are executed with a single thread in each shard for this setup. This means that only 4 CPUs were fully utilized at a time during a single faceted search.

Yesterday we reached 12 shards for a total of 10.7 TB of index data in 3.6 billion documents. Turning off the artificial RAM limitation left us with 140 GB of RAM for disk cache or 1.3% of the full index size. So more RAM for cashing than we had with 4 shards and still plenty of CPU power as the machine has 16 cores (*2 if you count HyperThreading). Of course the merging gets a little heavier, but not too much. In an ideal world this would mean that the performance would be unchanged, right?

No, we did not think so either. So how bad is it?


256GB RAM, 1 thread, 12 shards, 10TB, random words, Solr & sparse faceting on URL

256GB RAM, 1 thread, 12 shards, 10TB, random words, Solr & sparse faceting on URL

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL

Ouch. 2½ second median? That’s bad! But wait a minute… Doesn’t that percentile plot look like a whale? And how come this is so much slower than our 4 shard setup and how come it is faster to facet on 100,000 values than it is to facet on 10,000? Time to check the logs.

Distributed Solr faceting is two-phase. First phase is standard faceting (find the top-X facet terms for a given search). The merger then collects the results, sums them and extracts the collective top-X terms. Second phase is to ask the shards for the counts for those terms, to get the correct count as the terms might not have been returned from all shards in the first phase. The merger is smarter than that, but the principle holds.

It seems logical that second phase is faster than first phase: It just has to calculate counts for a limited amount of specified terms instead of performing full facet processing. Let’s go back to the logs and plot the response times for first and second phase separately. Note that these numbers are from the 12 Solr logs at shard-request level and not at the full distributed call level.

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

There goes that logical conclusion. The second phase takes more than 10 times as long as first phase! What is happening? We need to dig deeper and look at the surprisingly simple code for second phase:

private NamedList getListedTermCounts(String field, String termList, DocSet base) throws IOException {
  FieldType ft = searcher.getSchema().getFieldType(field);
  List<String> terms = StrUtils.splitSmart(termList, ",", true);
  NamedList<Integer> res = new NamedList<Integer>();
  for (String term : terms) {
    String internal = ft.toInternal(term);
   int count = searcher.numDocs(new TermQuery(new Term(field, internal)), base);
   res.add(term, count);
 return res;

We have the result of the query in the bitset base; with shards of 300 million documents, that is a rather large bitset. For each term in the second phase facet request, a simple search is performed for facet_field:specific_term. This results in another bitset. The number of intersecting bits in these two sets is the count for the term.

The problem here is that we are doing intersections of very large bitsets. Potentially they can be represented by compact hashmaps or other structures, but the log tells us that this still takes quite a lot of time for a corpus of this size. Time that grows as the number of shards grows.

Guessing time: If at least one of the bitsets is a bitmap with 1 bit per document in the index, that takes up about 40 MB of heap, which is accessed when doing the intersection. If there are 20 terms in the request (quite likely as we ask for top-20 on URL), this is done 20 times. So a least 800MB of memory is accessed. With 12 shards doing faceting in parallel, this is just shy of 10 GB of memory. It is hard to measure memory wait time, but it seems a likely culprit that we are simple waiting for main memory.

The shape of the phase 2 plot is easily explainable: With 10 to 10,000 hits, all shards must provide counts for nearly the full set of terms. When we get above 100,000, the chances of any shard having already delivered the count for part of the top X terms rises; when it has already been delivered, it will not be requested again in the second phase so the workload gets smaller.

So what to do about that? The easiest thing is to skip the second phase completely. That would give us the great response times instantaneous at the cost of precision: The counts for the individual terms in the facet might be a bit off. But maybe there is a way to get the fast speed and the precise counts? The full facet counting out in the shards in the first phase was quite a lot faster, so if we do that again (or cache the old result), we have counts for all terms in the facet. For each specified term, we could get its ordinal (for example by binary search, meaning 36 lookups in our concrete ordinal-term map) and with that, we could get the count directly. Ordinal-term lookups are somewhat heavy as the index data on storage needs to be accessed, so it remains to be seen if this way of handling the second phase is faster than the standard one. Time to code again.

Whale hunting with Solr

August 13, 2014 by

Our web archive index passed the 10TB mark a few days ago, so it was time for new performance measurements. To recap: 12 shards @ 900 GB, a total of 10.7TB or 3.6 billion documents. Served from a single 256GB machine with a 1 TB SSD dedicated for each shard.

We started by simple sequential searches for random words from a Danish dictionary. No faceting or other fancy stuff. The requests were for top-10 documents with their stored content. We measured the full processing-time (i.e. HTTP-get) and got this:

10.7TB, 3.6B, simple searches

Non-faceting searches on a 10.7TB, 3.6B document Solr index

We call it the whale and we have been a bit obsessed with it, since we discovered it 3 months ago when we saw it with 4 shards. Response times for 100 to 1 million hits are great, but what happens with the response times around 10 hits!? Inspection of the Solr logs showed nothing suspicious: Solr’s reported processing time (QTime) for the individual shards were 1 or 2 ms for the queries in question, while QTime for the merging Solr instance was 20-50 ms. Those are fine numbers.

Some of the queries with 10 hits were “blotlægge gøglertrup”, “eksponeringstid attestering” and “fuldkost hofleverandør” (quite funny in Danish actually; remember the words were selected at random). Those searches all took around 500 ms, measured from the outside of Solr, with reported QTimes below 50 ms. Could it be a HTTP-pequliarity, as Mikkel Kamstrup suggested? Diving into the concrete responses illuminated us.

Simple queries with very few hits in a large corpus happens because the query terms rarely occur in the same document. So which documents has a high chance of co-occurrence of random words from the dictionary? A dictionary of course! In a (hopefully vain) attempt of “search engine optimization”, some Danish web pages has embedded a dictionary below the real content (assumedly hidden by making the font color the same as the background or something like that). Normally such pages are ranked low due to the magic of Lucene/Solr, but with very few hits, they still become part of the search result.

So, statistically speaking, searches with few results gives us huge pages. Internally in Solr they are still processed quite fast (hence the fine QTime-numbers), but serializing the result to XML is not a light task, when the result is measured in megabytes. Had we just requested a few fields, such as URL and content_type, there would have been no hiccup. But we requested everything stored, including content_text. If we just request 15 limited-length fields for each documents and repeat the test, we get this:

10.7TB 3.6B

Non-faceting searches on a 10.7TB, 3.6B document Solr index, limited fields returned

Now that was strange. We got rid of the hump back, but overall the performance suffered? Does it take more time to ask for specific stored fields instead of all? Still, response times below 100 ms for the majority of searches is quite acceptable. Mystery considered solved!

Terabyte index, search and faceting with Solr

June 17, 2014 by

Vision, data & index building

Providing responsive freetext search and display with faceting and grouping for the full Danish net archive, for a limited number of researchers. The data in the net archive has been harvested from *.dk-domains using the Heritrix harvester since 2005 and currently amounts to 450TB, with approximately 10 billion entities.

Indexing is done with netarchive/netsearch, developed primarily by Thomas Egense and based on UKWA Webarchive Discovery: A central service keeps track of ARC-files, X controllers requests the path for ARC-files and keeps Y workers running. Each worker uses Tika to analyse the given ARC-file and sends the generated Solr documents to a Solr instance, specified by its controller. When the wanted index size is reached (900GB in our instance), the index is optimized down to a single segment and pushed to the search server.

Currently indexing is done on a single 24 core (48 with HT) Linux machine with 256GB RAM and 7TB SSD in RAID 0, running all parts of the indexing workflow. Sweet spot for that machine is 40 workers and 1 Solr, resulting in 90%+ CPU usage, primarily used by the Tika workers. It takes about 8 days to build one 900GB index. As of 2014-06-17, 4 such indexes has been build.

The indexing machine is not very well balanced with way too much RAM: Each worker runs fine with 1GB, Solr takes 32GB in order to handle merging down to a single 900GB index; 80GB would be enough. The SSDs in RAID 0 also have too much free space; 3-4TB would work fine with room for 2 shards. We expect the machine to be used for other jobs when the full indexing has finished and we switch to a much lighter day-to-day index update.

Search architecture

A single rack-mounted Linux server is responsible for handling the full search load. It is an 16 core (32 with HT) machine with 256GB RAM and 2 disk controllers with a total of 24 1TB commodity Samsung 840 SSDs, each mounted individually, each holding a separate index, each handled by a separate Solr instance. Distributed search is done with SolrCloud. The total cost for the search hardware is < €20K.

Search in the web archive is not high-availability – we accept that there can be downtime. Should a SSD fail, search will be unavailable until a new one has been installed and its index restored from backup. We are looking into using the backup files for failed drives directly from the backup storage as a temporary measure until the drives are ready again, but that is only at the investigation stage.

Basic search performance

At the time of testing, the index consists of 4 shards @ 900GB for a total of 3.6TB index data with 1.2 billion documents. Each Solr instance (one per shard) has 8GB of heap. As the machine is build for 20-24 shards, the index data represents just 1/6th of the expected final size. This leaves the machine vastly overpowered in its current state, with a surplus of CPUs and ~220GB of RAM for disk caching.

How overpowered? We tested back when the index was only 3 shards for a total of 2.7TB: User issued queries are handled with edismax on 6 fields and 1 phrase query on the catch-all field, a max of 20 returned documents with 10-15 stored fields of varying size. We tried hitting the server with just a single thread:

1 thread, 3 shards, 2.7TB, random words, no faceting

256GB RAM, 1 thread, 3 shards, 2.7TB, random words, no faceting

Response times way below 100ms when the number of hits are below 1 million, better than linear scaling above that? On an unwarmed index? Time to up the ante! What about 20 threads, this time on 4 shards for a total of 3.6TB?

20 threads, 4 shards, 3.6TB, random words, no faceting

256GB RAM, 20 threads, 4 shards, 3.6TB, random words, no faceting

It looks a bit like a whale and with 20K points, it is very hard to make sense of. Time to introduce another way of visualizing the same data:

256GB RAM, 20 threads, 4 shards, 3.6TB, random words, no faceting, percentile plot

256GB RAM, 20 threads, 4 shards, 3.6TB, random words, no faceting

This is a Box and Whisker plot, showing the quartiles as well as the min and max response times. The measurements are bucketed with 1-9 hits in the first bucket, 10-99 hits in the next and so forth. Again the magic point seems to be around 1M hits before performance begins to drop. The throughput was 66 searches/second. Repeating the search with 40 threads resulted in about the same throughput and about a doubling of the response times, which indicates that the 16 CPUs is the bottleneck.

Now, the Danish web archive is not Google. Due to legislation, the number of concurrent users will normally be 1 and searches will involve statistics and drill-downs, primarily meaning facets. While very impressive, the measurements above are really not representative of the expected use scenario. Time to tweak the knobs again.

Faceting on high-cardinality fields

For the end-scenario, we plan on faceting on 6 fields. One of them is the URL of the harvested resource, with nearly 1 unique value per resource. That means around 300 million unique values per shard, with 1.2 billion in the current 4 shard index and an estimated 7 billion in the final 24 shard index.

Normally it would seem rather unrealistic to facet on 300M+ documents with nearly as many unique values with 8GB of heap (the allocation for each Solr instance), but there are several things that helps us here:

  • The URL-field is single value, meaning a smaller and faster internal faceting structure
  • Each index is single segment, so no need to maintain a global-ords-mapping structure, fully skipping this normally costly memory overhead
  • DocValues works really well with high-cardinality fields, meaning low memory overhead

For this experiment we switched back to single threaded requests, but added faceting on the URL field. To make this slightly more representative of the expected final setup we also lowered the amount of RAM to 80GB. This left 40GB- for disk caching of the 3.6TB index data, or about 1%.

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr faceting on URL

500-1200ms for a search on 3.6TB with very high-cardinality faceting. Nice. But, how come the response time never gets below 500ms? This is due to a technicality in Solr faceting where it iterates counters for all potential facet terms (1.2 billion in this case), not just the ones that are actually updated. A more thorough explanation as well as a solution can be found in the blog post on Sparse Faceting. Let’s see a graph with both Solr standard and sparse faceting:

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

Or viewed as a Box and Whiskers plot, for sparse faceting only:

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

A quite peculiar looking development of response times. Still, looking at the whale graph at the beginning of this post, they do seem somewhat familiar. This is definitely an experiment that could benefit from a re-run with more sample points. Anyway, notice how even searches with 10M hits are faceted in less than 800ms.


So far our setup for search in the Danish web archive looks very promising. We have showed that searching with faceting on very high-cardinality fields can be achieved with acceptable (< 1 second in our case) response times on relatively cheap hardware. We will continue testing as the index grows and adjust our hardware, should it prove inadequate for the full corpus.

Sparse facet counting without the downsides

April 4, 2014 by

The SOLR-5894 issue “Speed up high-cardinality facets with sparse counters” is close to being functionally complete (facet.method=fcs and facet.sort=index still pending). This post explains the different tricks used in the implementation and their impact on performance.


Most of the different Solr faceting methods (fc & fcs; with and without doc-values; single- and multi-value) uses the same overall principle for counting tag occurrences in facets:

  1. Allocate one or more counters of total size #unique_tags
  2. Fill the counters by iterating a hit queue (normally a bitmap) and getting corresponding counter indexes from a mapper
  3. Extract top-x tags with highest count by iterating all counters

There are 3 problems with this 3 step process: Allocation of a (potentially large) structure from memory, iteration of a bitmap with #total_documents entries and iteration of a counter with #unique_tags. Ideally this would be no allocation, iteration of just the IDs of the matched documents and iteration of just the tag counters that were updated. Sparse facet counting solves 2 out of the 3 problems.


In this context sparse is seen as performance-enhancing, not space-reducing. SOLR-5894 solves the extraction time problem by keeping track of which counters are updated. With this information, the extraction process no longer needs to visit all counters.  A detailed explanation can be found at fast-faceting-with-high-cardinality-and-small-result-set. However, there are some peculiarities to sparse tracking that must be considered.

Processing overhead

Naive sparse faceting

The black line is Solr field faceting on a multi-valued field (3 values/document), the red line is the sparse implementation on the same field. When the result set is small, sparse processing time is markedly lower than standard, but when the result set is > 10% of all documents, it becomes slower. When the result set reaches 50%, sparse takes twice as long as standard.

This makes sense when one consider that both updating and extraction of a single counter has more processing overhead for sparse: When the number of hits rises, the accumulated overhead gets bad.

Maximum sparse size

Okay, so tracking does not make much sense past a given point. Besides, having a tracker the size of the counters themselves (100% overhead) seems a bit wasteful. Fixing the tracker size to the cross-over-point is the way to go. We choose 8% here. Thanks to the beauty of the tracking mechanism, exceeding the tracker capacity does not invalidate the collected results, it just means a logical switch to non-track-mode.

8% tracker

No doubt about where the sparse counter switches to non-sparse mode. Note how the distance from Solr standard (black line) to sparse with tracker-overflow (red line past 8%) is near-constant: Up until 8% there is an overhead for updating the tracker. When the tracker has overflowed that overhead disappears for the rest of the counter updates, but the cycles used for tracking up to that point are wasted.

Selection strategy

So memory overhead was reduced to 8% and performance was better for the very high hit counts, but still quite a bit worse than Solr standard. If only we could foresee if the sparse tracker would be overflowed or not.

We cannot determine 100% whether the tracker will be blown or not (at least not in the general case), but we can guess. Under the assumption that the references from documents to tags are fairly uniformly distributed, we can use the hit count (which we know when we start facet calculation) to guess whether the number of updated tag counts will exceed the tracker capacity.

Sparse bad guesses for cut-off

The chart demonstrates how bad guessing of the result size affects performance. The conservative guess (red line) means that many of the faceting calls are done by falling back to standard Solr and that the sparse speed-up is wasted. The optimistic guess (cyan line) has a higher risk of failed sparse-attempts, which means bad performance. In this example, the bad guess was around 10%. Still, even with such hiccups, the overall positive effect of using sparse counting is clear.

Good guessing

The best cut-off point for sparse/non-sparse attempt depends on the corpus and the searches, as well as the willingness to risk increased response times. Properly tuned and with a corpus without extreme outliers (such as a single very popular document referencing 10% of all tags), the result will be very satisfying.

Sparse good guess

For the low price of 8% memory overhead we get much better performance for small result sets and no penalty for larger result sets (under the assumption of correct guessing).

Counter allocation

Doing a little instrumentation it becomes clear that it is by no means free just to allocate a new counter structure with each facet call and throw it away after use. In the example above, 5M*3*4byte = 60MB are used for a counter. With a 2GB heap and otherwise non-tweaked execution of Solr’s start.jar, the average time used to allocate the 60MB was 13ms!

An alternative strategy is to keep a pool of counters and re-use them. This means that counters must be cleared after use, but this is markedly faster than allocating new ones. Furthermore this can be done by a background thread, so that the client can get the response immediately after the extraction phase. Enabling this, the picture gets even better.

Sparse everything

For very small result sets there is virtually no performance penalty for faceting.

Sparse facet counting on a web archive

March 20, 2014 by

This post is a folow-up to Sparse facet counting on a real index. Where the last post explored using a sparse counter for faceting on author on Statsbibliotekets index of library material, this post will focus on faceting on url for our test index of archived web pages.

The index and the url field

The test index is 1TB (or 959 GB to be precise) with 420 million documents. The documents are made from harvested web pages and linked files of all types. Tika was used for the extraction of meta data. All documents has exactly one URL. Some documents are harvested multiple times, so there are only 315 million unique URLs instead of 420 million. The most popular URL occurs 3303 times . The index is optimized down to a single segment, but this has very little impact on tag collection speed.

Sparse counting

As described in Fast faceting with high cardinality and small result set, adding an overhead of 1/40 to the counting structure allows for faster counting with small result sets and graceful fallback to larger result sets. Using 32 bit integers as counters, this means 4*315 Mbyte for the counters themselved + 7 MByte for the sparse tracker. Including fluff, this turns out to be 1228 MByte for our test corpus.

Sparse packed counting

But wait! We know that there are at most 3303 occurrences of any single tag, so why use 32 bit integers as counters? 2^12 = 4096: We only need 12 bits for each. PackedInts to the rescue and we only need 506 MByte for a full counter structure. Less than half of the non-packed version.

Testing methodology

As this test is about the difference between non-sparse and sparse facet counting, the test was executed twice and only the results from the second run were collected. This ensured that the I/O cache was warmed so that the impact of the term-matching searches and document retrieval was minimized. As we have no logged queries, random words from the Danish iSpell dictionary were used. Only the results with hits were collected. Queries were issued sequentially with a 10ms wait between each. Time are full response times, measured through JMeter. In general, each query rarely returned more than 100 hits (which shows that we need better queries than random words).


Exposed faceting on 1TB web archive

The very low spread is due to the searcher being fully warmed for the used queries, which also explain the 5 ms response times for non-faceted queries. Plain exposed faceting takes half a second, because all 315 million counters are iterated for each extraction. On the other hand, the small result sets means that sparse and packed only needs to process a few hundred entries. Notice that the performance of packed is practically identical to sparse.

Should anyone wonder: Unwarmed non-faceted searches for one or two random words varied between 10 and 1000 ms for this index.

Hold your horses

This all looks fantastic, but do remember that the hits/corpus_size ratio is extreme here. Our previous test with our library index is a better indicator of what can be expected and that “only” showed double the speed when using sparse.

Sparse facet counting on a real index

March 18, 2014 by

It was time for a little (nearly) real-world testing of a sparse facet counter for Lucene/solr (see Fast faceting with high cardinality and small result set for details). The first results are both very promising and somewhat puzzling.

The corpus was a copy of our production 50GB, 11M document index with library resources. The queries were taken randomly from the daily query log. Faceting was limited to just the author field, which has 7M unique values. The searcher was warmed up with hundreds of queries before testing. The tester ran with 2 threads against a 4 core machine and 500 searches were performed for each implementation.

Solr vs. exposed vs. sparse

In the graph, solr is standard Solr field faceting, exposed is our home brew (SOLR-2412) and sparse is our experimental home brew with sparse counting for small result sets. The red horizontal lines represents quartiles, with the max being replaced with the 95% for better graph scale. The horizontal black lines are medians.

The promising part is that the sparse counter has a much better median (16ms) than both solr (32ms) and exposed (29ms). Looking at the returned results, it seems clear that the vast majority of the queries only hits a fairly small part of the index, which benefits the sparse implementation. As they are real world queries, this is good news.

The performance of Solr field faceting and exposed is fairly similar, which is not very surprising as they work quite alike in this scenario. What is puzzling is that the maximum response time for both exposed and sparse is higher than solr‘s. The slowest response times not shown are 458ms for “have” with solr, 780ms for “to be or not to be” with exposed and 570ms for “new perspectives on native north america” with sparse. More testing is needed to determine if these are fluke results or if there is a general problem with worse outliers for exposed and sparse.

Update 20140320

Randomizing queries make for great experimentation but poor comparisons of implementations. Fixing the order and number of queries tested (29086, should anyone wonder) resulted in measurements without the strange outliers. The measurements were done in order exposed, sparse, packed, solr and nofacet. Maximum response time were a little above 2 seconds for all the facet calls and in all cases caused by the query ‘a‘.

Sparce faceting, fixed query order and count

Update 20140321

Introducing the JIRA issue SOLR-5894, with an accompanying patch for Solr 4.6.1. The patch only handles field cache based faceting on multi-valued fields right now, but that limitation was mainly to keep things simple. The sparse code is non-invasive and fits well within the way Solr performs field faceting. A quick experiment with 1852 fully warmed queries gave this:


Update 20140324

Whoops. Forgot to include the baseline no-facets numbers. This changes the picture quite a bit. With a baseline median of 12 ms, sparse faceting overhead is only (24 ms – 12 ms) 12 ms and non-sparse is (36 ms – 12 ms) = 24 ms, which suspiciously (I triple checked the numbers) makes the sparse faceting overhead exactly half of non-sparse.


Fast faceting with high cardinality and small result set

March 17, 2014 by

This is a follow-up to the idea presented  more than a year ago at http://sbdevel.wordpress.com/2013/01/23/large-facet-corpus-small-result-set/. It can be read independently of the old post.

The premise is simple: We have a Lucene/Solr index and we want to do some faceting. The facet field has high cardinality, which is a fancy way of saying that it has a lot of unique values. “A lot” is tens or hundreds of millions.

Old style counting

Ignoring all the little detail devils, getting the top 10 tags in a facet (sorting by count) is normally done like this:

// Init
counts = [#unique_tags]
// Search & collect
for each docID in hits {
  for each tagID in docID {
// Extract
topTen = []
for tagID 0 ... #unique_tags {
  if (counts[tagID] > min(topTen)) {
   remove min(topTen) from topTen
   insert (tagID, counts[tagID]) in topTen
// Result
create result from topTen
// Cleanup
for tagID 0 ... #unique_tags {
  counts[tagID] = 0

The init-part and the cleanup-part differs between implementations. Solr lets the JVM handle it by allocating new structures in the init-phase and dereferencing it in the cleanup-phase so that the garbage collector takes it. Our home brew SOLR-2412 caches the counters, which requires a cleanup after each run but has very stable memory and GC impact.

Notice how the extraction-phase and the cleanup-phase iterates all the tagIDs for the whole corpus, even if the result set itself is tiny? That is not very efficient. If we knew the number of unique tags for the result set in advance we could select between e.e. the big counter array and a small hash set for keeping track of the tags, but we do not have that luxury.

Track the counters

With Using Uninitialized Memory for Fun and Profit in mind, we create a new counter that is efficient for small result sets and with little speed penalty for large result sets. It is not too complicated:

TagCounterSparse {
  counts = [#unique_tags]
  updates = [#unique_tags / fraction]
  updatePointer = 0

  collect(tagID) {
    if (counts[tagID]++ == 0 && updatePointer != updates.length) {
      updates[updatePointer++] = tagID

  extract {
    topTen = []
    if (updatePointer != updates.length) {
      for each tagID in updates {
        if (counts[tagID] > min(topTen)) {
          remove min(topTen) from topTen
          insert (tagID, counts[tagID]) in topTen 
    } else {
      for each tagID in counts {
        if (counts[tagID] > min(topTen)) {
          remove min(topTen) from topTen
            insert (tagID, counts[tagID]) in topTen

  clear {
    if (updatePointer != updates.length) {
      for each tagID in updates {
        counts[tagID] = 0
    } else {
      for each tagID in counts {
        counts[tagID] = 0
    updatePointer = 0

What happens here is that a counter-tracker updates is introduced. When the count for a previously unseen tagID is increased, the tracker stores that tagID. If too many unique tagIDs are added, the tracker stops keeping track.

Extraction of top ten tags can be done in two ways. If there were too many unique tags in the result set, they are extracted exactly like the old implementation. If the result set was small enough, only the counts for the collected tagIDs are accessed. For really small result sets, this is blazingly fast.

Clearing is similar to extraction. Either it happens the old way or only the collected tagIDs are touched.

Considerations and gotchas

The sparse tag counter adds another layer of indirection, so if the result set is the whole corpus and if the updates is the same size as counts, all the phases will be slower than the old solution. We need to find out how large a part of counts we should keep track of. This is the fraction.

Another consideration is that the old iterate-all-tagIDs had the very nice property of accessing memory sequentially. The sparse solution is random access for each collected tagID, which is a lot slower. This heavily influences what the fraction should be.

Measure twice

An artificial Solr index was created. It had 20M documents and a single-value field with 20M unique values. Searches were performed with result sets consisting of every 2nd document, every 5th document, every 10th and so on. For each search the faceting time spend on collecting, extracting and clearing was measured. First results for each unique search were discarded and all searches repeated 5 times with the best results being extracted. All times are in milliseconds.

Old implementation

Every  Collect  Extract  Clear  Total
    2      527      153     12    692
    5      215       84     12    311
   10      112       56     12    180
   20       62       40     12    114
   30       44       33     12     89
   40       36       31     11     78
   50       29       30     12     71
  100       15       27     12     54
  200        8       25     12     45
  500        4       24     12     40
 1000        2       24     12     38
 5000        1       23     12     36

Notice how the extract time converges towards 20ms and the clear time is constant.

Sparse (#updates = #counts)

Every  Collect  Extract  Clear  Total
    2      544      453    160  1157
    5      221      183     63   467
   10      114       92     32   238
   20       63       46     16   125
   30       46       31     10    87
   40       38       23      8    69
   50       30       19      6    55
  100       15       10      3    28
  200        9        5      1    15
  500        3        1      0     4
 1000        1        0      0     1
 5000        1        0      0     1

Notice that the collect time is only a little worse than the old collect time, that the extract time is markedly better when the result set is every 40th document or less and that clear is also markedly faster for every 40th document or less.

This suggests that fraction should be 40. For this corpus, which is artificial. Your mileage will most certainly wary, but for this test we set it to 40.

sparse (#updates = #counts/40)

Every  Collect  Extract  Clear  Total
    2      545      151     12    708
    5      227       86     13    326
   10      118       56     13    187
   20       65       40     12    117
   30       47       35     13     95
   40       40       24      8     72
   50       31       19      6     56
  100       18       10      3     31
  200        9        5      2     16
  500        2        2      0      4
 1000        1        0      0      1
 5000        0        0      0      0

Notice how extraction and clear time is the same as for the old implementation for results sets with every 2th, 5th, 10th, 20th and 30th document. After that, extraction and clear times matches the sparse with #updates = #counts. It is the best of both worlds.


For this test, using a sparse tag collector, taking 1/40 more RAM than a standard collector, results in increasingly better relative performance the smaller the result set gets. The speed converges towards instant response. This comes at the price of slightly worse performance for large result sets.


Right now this is just a proof of concept that resides in our local Summa-project (look for TestSingleSegmentOptimization.java). It needs to be tested on real world corpora and with logged queries before any real patches for Solr is to be considered. It will end up on SOLR-2412 though, when I find the time to upgrade it to Lucene/Solr 4.7.

SCAPE All-Staff Meeting 2014 in Póvoa de Varzim, Portugal

February 21, 2014 by

People in our own department at the State and University Library have complained, that we don’t talk enough, tell enough, write enough about this SCAPE project, we are working on, so here goes. I won’t tell you about the SCAPE project from the beginning. SCAPE stands for SCAlable Preservation Environments, and it has a project website, where you can read newsletters and deliverables and about the tools used and developed in SCAPE. What I will tell you about is the All-Staff Meeting 2014 in Póvoa de Varzim, Portugal. We are four SCAPE’rs from SB at this meeting. We had a long trip, that is over 10 hours, and we arrived after midnight Sunday, so I guess that would be Monday morning. The meeting started at 9 with General Session I: Welcome & Reports, which was more interesting than it sounds. It included introduction of new partners and status reports from the different subprojects and the TC. After the coffee break, we had the General Session II: Product Elevator Pitches. There were some really cool talks about the SCAPE CloudManager, about Sharing SCAPE software through virtualisation (using Vagrant) and a lot more. My favourite was of course Asgers talk about SB Repository Integration:


And now I’ll try to speed things up a bit. We were staying at the Axis Vermar Conference & Beach Hotel. There were good lunches and a beach with magnificent waves, and a lot of parallel meetings with a lot of product presentations, some discussions and some road maps. After lunch I went to the Testbeds meeting, which concluded that all the Testbed partners have

  • Hadoop test platforms
  • Large scale test data sets (at least 1TB)
  • Test workflows

Coming up is

  • Finish development of workflows (preferably yesterday)
  • Performing experiments and evaluations
  • Writing blog posts and deliverables
  • And then there are the on-site demos to be announced in April and held in May

On Tuesday we had General Session III: Productisation. In this session I was involved in the xcorrSound discussion. We mostly discussed “Entry Points”, such as microsites, Github source, SCAPE-Project website, blogs and demo pages. We discussed the type/level of information they contain (i.e. what information different users are interested in), and especially that some people may be search via use case and want to know what tools will be useful rather than search for tools via their functionality. We decided to use microsites as central point of contact/information about a tool, and we should ensure microsites have a consistent look and feel. And we decided contact information is important, and I am now listed as maintainer of xcorrSound (already were in some places; will update others). We also talked about improvements and extensions to the xcorrSound demo page and a reference installation. After this I was at the PW (Planning and Watch) meeting. Then the PC.QA (Preservation Components . Quality Assurance) meeting. My work in this work package is also centered around xcorrSound. We mostly talked about the road plan, which looks roughly like this:

  • Checkpoint: Implementation of QA workflows (Hadoop version; done)
  • Checkpoint: Implementation and release packaging (Tool packaging; done)
  • Milestone: Release 3 benchmarking and validation tests (Todo)
  • Deliverable: Knowledge base of characterisation and migration errors (KB)
  • Deliverable: Quality Assurance Workflow, Release 3 + Release Report (Todo)

After this we had Product Forum I – Scalability with more interesting presentations. And then the SB group went to a nice fish restaurant by the beach. Wednesday morning there were more meetings, but none in which I was required, so I went for a walk on the beach and collected seashells (and got sand in my shoes). After lunch, we had the Product Forum II – Tools session, in which I had a presentation/demo on xcorrSound.


This will also be included in the upcoming in house demos at SB. The last session was “Reporting back and closure” and everybody was tired. The most interesting for me was from the TCC (Technical Coordination Committee) meeting: there will be a white paper on how we get from preservation plan to execution environment in SCAPE (some time). And there will be a SCAPE workshop and all-staff event in connection with the DL2014 conference in London in September. And now I better pack and get ready for going home. We are leaving at 10am Thursday, and we will probably be home around 1am Friday…

Watch the watchers when using AngularJS

February 12, 2014 by

The last couple of months we have been improving and porting our library search interface frontend to AngularJS. However great Angular’s two-way bindings are they can greatly impact performance and lately I have been looking into how we use them in the beta version of our search front end which is our first big AngularJS web app.

In this process I used the Batarang debug tool in Chrome to see what was going on in terms of registered watchers and performance and it quickly pointed out two potential problems.

1. Our translation plugin generates tons of watchers

[NOTE (2014-06-17): section below is outdated as it applied to angular-translate v. 1.1.0. The current version (2.2.0) supports unregistering of watchers out of the box when 'translate' is used as a directive]

First of it was noticeable that a large chunk of the registered watchers where attached to our translation plugin Angular-translate. Every time we use the translate directive a watcher is attached and with hundreds of bits of translated text on a page the watcher count quickly climbs. This behavior is per design as it is possible to change the language run-time. In our case we do a page refresh when the language is toggled and very few labels are toggled run-time so this seemed like a good place to optimize.

As the current version of Angular-translate does not have native support for unregistering watchers I looked at solutions like Bindonce.

Bindonce provides a set of attributes that allows the initial binding and after that your variable is “dead” and will not update on model change. Initial testing in Batarang with the Bindonce approach of course showed a significant decrease of watchers and thereby an increase in performance and best of all the app visually behaved exactly the same as before. Only drawback with the Bindonce plugin is that the templates need to be rewritten and the new code is not as clean is the old.
An example of the current approach (translate used as directive):

<div translate=’TRANSLATION.KEY’></div>

New Bindonce approach (translate used as filter):

<div bindonce bo-text=” ’TRANSLATION.KEY’ | translate ”></div>

Although this solves the performance issues we have with the translate module I would rather see a ‘stop watching’ functionality built into the plugin. Fortunately a lot of work is currently being put into this issue and it seems that the next release of angular-translate (1.2.0) will address this.

2. Unnecessary use of Angular watchers
Next performance issue related to watchers was our use of databindings in templates. Every Time you write {{foo}} you create a two-way binding in AngularJS and ‘foo’ is therefore watched. This is of course one of the core functionalities of the framework but you need to be aware that the watchers come with a performance penalty especially when the number of watchers grow. Every time $apply is called directly or implicitly it will force a $digest cycle and the watch-expressions are evaluated.

In our app the Batarang tool revealed that besides the translation issues a lot of watchers were registered to links in our faceting functionality. Every facet contains a link where the href is generated in the following way in the template:

<a href=’{{searchURL}}?query={{query}}{{currentFilter}}&filter={{facet.facetname}}:{{tag.name}}’>{{foo}}</a>

Each link has several data-bindings through {{}} and we have a lot of these links on the page. That is a lot of watchers. As these links do not change unless the template is run again they do not need to be watched and there would be a performance gain by creating these links without the watcher overhead. One way to do it would be to use a directive instead to generate the href:
In template:

<a href=”” facet>{{foo}}</a>


.directive(‘facet’, [ function() {
return {
link: function(scope, elem, attr) {
var  href = /**Do your magic here**/
attr.$set('href', href);
} }

This significantly cuts down the amount of watcher expressions.

Another way around it would be to utilise the Bindonce plugin and do something like this:

<a bindonce bo-href=”searchURL + ‘?query=’ + query + currentFilter + ‘&filter=’ + facet.facetname + ‘:’ + tag.name“>{{foo}}</a>

This will give you zero watchers attached to the link. Not a particularly clean solution but a very nice and “free” performance enhancement as the watchers aren’t needed in this case. We could even optimize further by getting rid of the {{foo}} watcher by converting it to a Bindonce text attribute:

<a bindonce bo-href=”searchURL + ‘?query=’ + query + currentFilter + ‘&filter=’ + facet.facetname + ‘:’ + tag.name“ bo-text=”foo”></a>

As we dig deeper in the app I’m sure that even more unused watchers will turn up and be dealt with.

I know that there will be other even smarter approaches to the above examples, other plugins you can use to deal with watchers or you could even brew your own directives to handle it but the main point remains intact. Watch the watches and this also means investigating what your plugins are up to. Batarang is a good tool for this. Especially as a novice in AngularJS you have to consider how and when to use the powerful two-way binding options. Don’t be afraid of them just use with care. Don’t let them pile up and only use them when required. If used excessively it can ruin performance and render your Angular app very sluggish on less powerful clients. Here we are certainly learning this the hard way as we build our first large Angular app.


Get every new post delivered to your Inbox.