Alternative counter tracking

April 30, 2015 by

Warning: Bit-fiddling ahead.

The initial driver for implementing Sparse Faceting was to have extraction-time scale with the result set size, instead of with the total number of unique values in the index. From a performance point of view, this works very well in the current implementation: A side car structure keeps track of which counters has been updated. It is a simple array of pointers (integers really) and sampling indicates that a size of about 8% of the total amount of counters works okay.

So 8%, expressed as Java integers? That is 4*0.08*#values bytes.

Less abstract, we have a use case with 600M values/shard. Rounding each counter up to nearest power of 2, the theoretical lower bound of memory needed for representing the counters are all the bits that can change during updates. For a sample shard that is about 140MB.

N-plane counters for the same shard takes up 144MB + 157MB, where the first 144MB are shared between all counter instances. So they are pretty close to the theoretical lower bound – at least the extra instances. Note: They can be forced to be practically at the lower bound, but that impacts performance.

Back to the tracker, because N-plane needs one to scale its performance with result set size. Tracker size was 4*0.08*#values bytes, which for the test shard is 192MB. For those extra counter instances, the tracker ends up using more memory that the counters it tracks.

What we need is something better to efficiently keep track of the touched values, where efficient is measured both in time and space. In fancy talk, that is a succinct data structure.

Implicit markers

With n-plane counters, each bit of a given counter is stored on a separate plane (at least conceptually): If a counter has a max of 5, it needs 3 bits to store the count and is thus spread across the lower 3 planes.

Suppose we treated the lowest plane separately from the rest of the planes: Instead of supplying the bit at position 0 for the usual binary representation, it simply states if 1 should be added to the number or not. Confused? Let’s represent the numbers from 0 to 8 in standard binary as well as our new system:

Decimal  0  1   2    3    4     5     6     7     8
Binary   0  1  10   11  100   101   110   111  1000  
Binary+  0  1  11  101  111  1001  1011  1101  1111

Note that the representation is not quite as compact as true binary; e.g. the tipping point for going from 3 to 4 planes is from 7 to 8 for binary, but from 4 to 5 for binary+. Loosely estimated, it will cost 1 extra bit/value to use the new representation. However, there is zero extra cost for counters with maximum value 1. It turns out that most of the outgoing links in our net archive corpus are unique, so about 400M of the 600M satisfies this. So the extra memory cost of the extended binary will be 200M / 8 bits/byte = 25MB.

Having a not-zero flag solves a performance issue with the current n-plane counters, but it also opens up for…

Bitmap based sparse tracking

The goal here is to quickly track & locate the counters that have a non-zero value. Underneath the hood, a plane in a n-plane counter is just an array of longs. With implicit markers, checking whether there is one or more counters with non-zero values in a group of 64 counters is a simple matter of checking whether a long is equal to 0.

With our 600M counters, that means about 10M checks, reading 75MB data sequentially from heap. Not bad, but still a substantial fixed overhead.

So what if we create a bitmap with a single bit representing each long in the backing structure of the first plane. If the backing long has 0 updated counters, the bitmap entry is 0, for everything else it is 1. That would require #counters/64 bits or 1.2MB. Locating the non-zero counters then becomes a matter of reading 18K longs sequentially and checking for 0. Now the fixed overhead seems manageable.

The bitmap would of course have to be maintained during updates, introducing an extra get-set to the act of incrementing a counter. However, the old tracker used a single set (and simpler logic), so the difference is not that great.

New idea vs. tried and true

Memory wise the new tracker would need about 26.2MB for 600M counters, where the old one needs 192MB. An obvious improvement.

For very low hit counts, the new tracker would loose: A fixed overhead of 18K checks is still more work than loading a few values directly from an array.

The complexity of the new tracker is far higher, but it might turn out to be fastest in most cases as iteration of the tracked counters are guaranteed to be in sequential order, where the existing tracker is random order.

Measuring N-plane time & space

April 30, 2015 by

This article explores the boundaries of the experimental sparse faceting code, both in terms of processing time and in terms of space requirements. The code base has just been updated and the new features are available as a Solr 4.8 compatible binary download, as well as source.

Faceting overview (you can skip this)

The core performance-critical part of sparse faceting as well as vanilla String fc-faceting for Solr can be divided in three phases:

  1. Count the occurrences of the terms in the facet field
  2. Find the X highest counters
  3. Resolve the Strings for the counters

Sparse faceting started with the idea of improving on #2 by only considering the counters that were actually touched by #1. Later on it improved on #3 when running distributed in SolrCloud. Now #1 is being handled by introducing multi-threaded counting.

Memory-wise there are three main faceting structures:

  1. A map from documents to term ordinals
  2. A counter structure, with 1 counter per unique term
  3. Structures for resolving ordinals to Strings

Memory overhead is very dependent on how the fields are indexed in Solr. The current recommendation for faceting is to index in a StrField (which stores the value verbatim) as well as enabling DocValues, which lowers memory requirements considerably. There is still a base cost of using DocValues (personal rule of thumb: 1 byte/String) with the current default DocValuesFormat, but that is allegedly hard to avoid without compromising performance significantly.

#1: In the usual case, a Solr index consists of multiple segments. In order to do faceting across these segments, a mapping from segment-ordinals to global ordinals is created. This scales linear to the number of terms in all the segments. In the special case of a fully optimized shards, there is no need for this mapping and thus the memory overhead of #1 will be zero. The Danish Net Archive Search uses fully optimized shards.

#2: Vanilla Solr uses a simple int[]-array to hold the counters: Great for low- to medium-cardinality faceting, but quite costly (2.4GB per concurrent call) if you want to facet on 600M unique values. With n-plane counters for sparse faceting, the density of the counting structures is twice the theoretical lower limit for a single counter (pedantic note: It really is a bit more than twice the theoretical lower limit, as we round up to nearest power of two – when somebody implements an efficient arithmetic coding counter, give me a shout). For the shards in the Danish Net Archive, this is about 300MB. Extra counter instances, used for concurrent searches, takes up only the theoretical lower limit, or about 150MB.
To get fast calculation of the top-X counters after the counting step, an extra structure is needed, which brings the total overhead up to about 3-4 times the theoretical lower limit. This can be disabled, but with cardinality in the hundreds of millions, this will impose a very heavy performance penalty.

#3: Having paid the initial cost of using DocValues, the resolving of ordinals to Strings is practically free in terms of memory. With non-DocValued fields, some of the Strings are stored in memory to ensure performing lookups.

Multi threaded faceting

Threading aims to lower latency (response time). It only works when there is a surplus of CPU power, so the typical scenario is setups with heavy but infrequent searches. Infrequent just means that multiple concurrent searches are rare. Vanilla Solr supports multi threaded faceting in two ways:

  1. Multiple facet fields can be processed in parallel.
  2. Using facet.method=fcs, multiple segments can be processed in parallel. However, this comes with a non-trivial merge cost that involves resolving of extra Strings.

Recently in-segment threading was added to sparse faceting. Technically this also covers #2 above, with the added benefit of not needing the merge step.

The tricky part of in-segment threaded counting was to get fast thread-safe updates of the packed counter structures, but for once amount of counters helps: The counter updates are lock-free and opportunistic, using AtomicLongArrays compareAndSet, which is nearly as fast as plain array-updates. In the rare case of collisions, the thread waits the lowest amount of time possible, before trying again.

Counters

The two term occurrence counters relevant to this experiment are:

  1. n-plane, which is small but uses complex logic and requires multiple random memory accesses (an average of 2 reads + 2 writes) when a counter is incremented. Its size, including the sparse structure, is 500MB for this experiment.
  2. packed, which is a variant of Packed64Singleblock using an AtomicLongArray as backing store. It has relatively simple logic and requires only 1 read and 1 write, when a counter is incremented. It saves a bit of space compared to the 2,4GB for vanilla Solr int[], but is three times larger (1,5GB) than n-plane.

Test setup

As we have recently started re-indexing everything in the Danish Net Archive (and since yours truly was a bit too quick on the fingers and accidentally deleted a couple of terabytes of perfectly fine shards), we only have a small sample corpus to test on. Hopefully it will be somewhat representative.

Recap: The Net Archive Search at Statsbiblioteket uses fully optimized shards of 900GB / 250M documents. One of the fields is outgoing links from webpages. Each shard has 600M unique values in the links-field, with 6 billion references from documents to values in that field.

Solr needs 2.2GB of heap just to open 1 such shard and perform a non-faceted search. With n-plane faceting, the minimum heap requirement rises to 3.3GB. In production we use a max heap size of 8GB as we have multiple facet fields and want to have wriggle room.

Due to the aforementioned circumstances, we currently only have 4 such shards in our SolrCloud to test on. They are handled by a 16 core Xeon machine with 256GB of RAM, so the hardware power vs. index size is very favourable, compared to the expected full setup with 23 shards.

Testing was done with random words from a Danish dictionary, one request at a time. The amount of threads used for counting was varied between 1, 2 & 4.

Results

 

4 shards of 900GB / 250M docs with 600M unique values in the facet field, tested with 1, 2 & 4 counting threads

4 shards of 900GB / 250M docs with 600M unique values in the facet field, tested with 1, 2 & 4 counting threads

The same graph, with 80 seconds as max for the y-axis:

4 shards of 900GB / 250M docs with 600M unique values in the facet field, tested with 1, 2 & 4 counting threads

4 shards of 900GB / 250M docs with 600M unique values in the facet field, tested with 1, 2 & 4 counting threads

In the legends, _t1 means 1 thread, _t2 is two threads etc. Some observations:

  1. There is something strange going on with packed in the very low end (1-1000 hits): Responses should be near-instantaneous in that area. This should be investigated further.
  2. packed is markedly faster relative to n-plane, although the overall speed means that the absolute difference is only visible from 100K hits and up.
  3. Threading gives a significant boost from 10M hits and up.

Note that the graphs only cover hit counts up to 100M, although there are 1000M documents in the index. Searching for random words does not really give that many hits: Besides the catch-all-term http, the second most popular term og (Danish for and) only occurs in ⅓ of the corpus.

Getting above 100M hits out of the 1000M documents in the current corpus is certainly possible by issuing or-queries, range-queries or similar, but it it not something that is likely to happen very often during interactive use.

Nevertheless, it must be stressed that worst-case performance is poor. Measuring the catch-all *:*-query (1,066,091,283 hits), the timing was:

Threads N-plane ms Packed ms
1 587,202 310,249
2 321,569 210,712
4 201,532 123,263

For interactive use, random response times in minutes is unacceptable. Besides adding copious amounts of hardware, some ideas for handling that problem are

  • Start by doing a fast non-facet search to get hit count. If that number indicates long faceting time, show a warning to the user. Potentially this could be a rough countdown.
  • Add sampling functionality, where the sample-count only uses the first 1M hits. Note that the hits are not processed in order of relevance, so the sample will be made up of hits from the beginning of the overall index structure (most likely the oldest records).
    A more advanced sampler could advance through the result set to distribute the sampling points better.

The small amount of shards means that there is a lot of CPU power free on the test machine. As it gets filled, there will be less surplus power and the effect of threading is likely to go down. Our setup might end up not getting any advantage out of threaded counting.

On the other hand, the low heap overhead of faceting with n-plane means that we will probably enable faceting on the 600M unique values field. It makes sense to pay the 1100MB heap overhead per shard to get fine-grained graph analysis; especially when the cost of additional counter instances is only about 350MB.

Facet filtering

April 10, 2015 by

In generation 2 of our net archive search we plan to experiment with real time graphs: We would like to visualize links between resources and locate points of interest based on popularity. Our plan is to use faceting with Solr on 500M+ unique links per shard, which is a bit of challenge in itself. To make matters worse, plain faceting does not fully meet the needs of the researchers. Some sample use cases for graph building are

  1. The most popular resources that pages about gardening links to overall
  2. The most popular resources that pages on a given site links to externally
  3. The most popular images that pages on a given site links to internally
  4. The most popular non-Danish resources that Danish pages links to
  5. The most popular JavaScripts that all pages from a given year links to

Unfortunately, only the first one can be solved with plain faceting.

Blacklists & whitelists with regular expressions

The idea is to filter all viable term candidates through a series of blacklists and whitelists to check whether they should be part of the facet result or not. One flexible way of expressing conditions on Strings is with regular expressions. The main problem with that approach is that all the Strings for the candidates must be resolved, instead of only the ones specified by facet.limit.

Consider the whitelist condition .*wasp.* which matches all links containing the word wasp. That is a pretty rare word overall, so if a match-all query is issued and the top 100 links with the wasp-requirement are requested, chances are that millions of terms must be resolved to Strings and checked, before the top 100 allowed ones has been found. On the other hand, a search for gardening would likely have a much higher chance of wasp-related links and would thus require far less resolutions.

An extremely experimental (written today) implementation of facet filtering has been added to the pack branch of Sparse Faceting for Solr. Correctness testing has been performed, where testing means “tried it a few times and the results looked plausible”. Looking back at the cases in the previous section, facet filtering could be used to support them:

  1. The most popular resources that pages about gardening links to overall
    q=gardening
  2. The most popular resources that pages on a given site links to externally
    q=domain:example.com&facet.sparse.blacklist=https?://[^/]*example\.com
  3. The most popular images that pages on a given site links to internally
    q=domain:example.com&facet.sparse.whitelist=https?://[^/]*example\.com/.*\.(gif|jpg|jpeg|png)$
  4. The most popular non-Danish resources that Danish pages links to
    q=domain_suffix:dk&facet.sparse.blacklist=https?://[^/]*\.dk
  5. The most popular JavaScripts that all pages from a given year links to
    q=harvest_year:2015&facet.sparse.whitelist=.*\.js$

Some questions like “The most popular resources larger than 2MB in size linked to from pages about video” cannot be answered directly with this solution as they rely on the resources at the end of the links, not just the links themselves.

Always with the performance testing

Two things of interest here:

  1. Faceting on 500M+ unique values (5 billion+ DocValues references) on a 900GB single-shard index with 200M+ documents
  2. Doing the trick with regular expressions on top

Note the single-shard thing! The measurements should not be taken as representative for the final speed of the fully indexed net archive, which will be 50 times larger. As we get more generation 2 shards, the tests will hopefully be re-run.

As always, Sparse Faceting is helping tremendously with the smaller result sets. This means that averaging the measurements to a single number is highly non-descriptive: Response times varies from < 100ms for a few thousand hits to 5 minutes for a match-all query.

Performance testing used a single thread to issue queries with random words from a Danish dictionary. The Solr server was a 24 core Intel i7 machine (only 1 active core due to the unfortunate single-threaded nature of faceting) with 256GB of RAM (200GB free for disk cache) and SSDs. All tests were with previously unused queries. 5 different types of requests were issued:

  1. no_facet: as the name implies, just a plain search with no faceting
  2. sparse: Sparse Faceting on the single links-field with facet limit 25
  3. regexp_easy: Sparse Faceting with whitelist regular expression .*htm.* which is fairly common in links
  4. regexp_evil: Sparse Faceting with whitelist regular expression .*nevermatches.* effectively forcing all terms in the full potential result set to be resolved and checked
  5. solr: Vanilla Solr faceting
900GB, 200M+ docs, 500M+ unique values, 5 billion+ references

900GB, 200M+ docs, 500M+ unique values, 5 billion+ references

Observations

  • Sparse Faceting without regular expressions (purple) performs just as well with 500M+ values as it did with previous tests of 200M+ values.
  • Using a regular expression that allows common terms (green) has moderate impact on performance.
  • The worst possible regular expression (orange) has noticeable impact at 10,000 hits and beyond. At the very far end at match-all, the processing time was 10 minutes (versus 5 minutes for non-regular expression faceting). This is likely to be highly influenced by storage speed and be slower with more shards on the same machine.
  • The constant 2 second overhead of vanilla Solr faceting (yellow) is highly visible.

Conclusion

Worst case processing times has always been a known weakness of our net archive search. Facet filtering exacerbates this. As this is tightly correlated to the result set size, which is fast to calculate, adding a warning with “This query is likely to take minutes to process” could be a usable bandage.

With that caveat out of the way, the data looks encouraging so far; the overhead for regular expressions was less than feared. Real-time graphs or at least fill-the-coffee-cup-time graphs seems doable. At the cost of 2GB of extra heap per shard to run the faceting request.

Additional notes 2015-04-11

@maxxkrakoa noted “@TokeEskildsen you wrote Solr facet is 1 thread. facet.threads can change that – but each field will only be able to use one thread each.“. He is right and it does help significantly for our 6 field faceting. For single field faceting, support for real multi-thread counting would be needed.

The simple way of doing multi-thread counting is to update multiple copies of the counter structure and merge them at the end. For at 500M+ field, that is likely to be prohibitive with regards to both memory and speed: The time used for merging the multiple counters would likely nullify the faster counter update phase. Some sort of clever synchronization or splitting of the counter space would be needed. No plans yet for that part, but it has been added to “things to ponder when sleep is not coming”-list.

Additional notes 2015-04-13

It seems that Java 1.7’s AtomicLongArray performs very well: Quite comparable to plain long[] in the context of multi-millions of counters, where contention is low. This raises the probability of implementing true threaded faceting markedly.

N-plane packed counters for faceting

March 13, 2015 by

Faceting in Solr works well out of the box up to some millions of unique values in the facet field. There is a small performance penalty linear to the number of unique values, which begins to show after some point. Sparse faceting solves this and makes faceting performance linear to the result size. Sparse faceting also reduces memory requirements by packing the values tighter. As seen in the blog post Long tail packed counters for faceting, this packing can be improved, depending on the concrete term distribution in the facet field.

An implementation of Long tail packed counters has been created and renamed to Dual-plane counters. During development, a new idea sprung to life in the form of N-plane counters. In this blog post the different counters will be explained and performance measurements from a micro-benchmark (yes, we know, they are not very reliable) based on real-world data will be presented.

Quick re-cap from the long tail blog post:

  • We have hundreds of millions of counters, referenced by ordinal
  • Each counter corresponds to a unique term in a facet field
  • We know the maximum for each individual counter
  • The overall distribution of the maxima is long tail
  • We want the overall counter structure to be small
  • We want the overall counter structure to be fast

Instead of viewing the maxima as plain numbers, they can be viewed as required bits. If a counter has a maximum of 15, all possible values can be expressed with 4 bits, as 2⁴-1 = 15. Similarly a counter with a maximum of 100 needs 7 bits, as 2⁷-1 = 127. The collection of counters can be visualized as

Term counters as bits

The counter for term A needs 1 bit, term B needs 3 bits, term C needs 1 bit and term D needs 11 bits. If we sorted the terms according to maxima instead of alphanumeric, the distribution would look like

Shape of maxima sorted by maxima

However, as the order of the terms is dictated by Solr, sorting the terms for use with faceting would require a mapping from Solr’s term ordinals to the position in the sorted list. This option has not been explored yet, so for now we’ll stick with the original order.

Test data

We extracted the maxima for the 640M unique links in a test-shard from our net archive and grouped them by the number of bits needed for expressing those maxima. The long tail distribution is evident. The theoretically optimal worst-case representation of the counters is the collective number of bits needed (the white squares seen above). For the sample below that is 138MB.

Bits #terms
1 425,799,733
2 85,835,129
3 52,695,663
4 33,153,759
5 18,864,935
6 10,245,205
7 5,691,412
8 3,223,077
9 1,981,279
10 1,240,879
11 714,595
12 429,129
13 225,416
14 114,271
15 45,521
16 12,966
17 4,005
18 1,764
19 805
20 789
21 123
22 77
23 1

The contestants

Counter-representation with atomic numeric

Stock Solr: Counter-representation with atomic numeric (y goes to 32, only bit 1-16 shown)

Stock Solr faceting uses an int[] to hold the counts. It is simple and fast. In the visualization above, the white squares are bits holding counter values, while the red area represents bits that are always 0. There is a lot of red with this counter structure. The sample counters takes up 2442MB with Stock Solr faceting. Pro: speed, con: space.

Sparse Faceting: Counter-representation with PackedInts

Sparse Faceting: Counter-representation with PackedInts

Sparse Faceting has the option of using PackedInts for counters. It lowers the overhead ceiling, but there is still a lot of wasted space. With Sparse Faceting, 1755MB is needed for the sample counters. Pro: speed, con: space + needs to know overall maximum.

Dual-plane: Counter-representation in low-bits with references to high-bit

Dual-plane: Counter-representation in low-bits with references to high-bit

Holding low counts in one structure and switching high-counts to another structure requires 1 extra bit/value for signalling which counters has a high value (see Long tail packed counters for faceting for details). Even with the extra bit, this gives an overall size reduction. Dual-plane uses 1221MB for the sample counters. Pro: speed, con: needs histogram of maxima for initialization.

N-plane: Horizontal slicing of counters

N-plane: Horizontal slicing of counters

Extending Dual-plane to an arbitrary number of planes and replacing the explicit pointers with counting logic, N-plane worst-case representation takes up just 2x the size of the raw bits. In the illustration above, the blue boxes are bits representing overflow up to the next plane. The overflow bits are counted to calculate the offset in the higher planes.
To avoid horrible performance, a rank structure is needed, which adds a bit of overhead. Likewise, it might be advisable to limit the number of planes, to avoid too many random memory requests. Most promising space/performance setup for the sample data takes up 341MB for N-plane, with the smallest & slowest taking up 275MB. Pro: space, con: speed + needs all counter maxima for initialization.

Bonus: All the logistic parts of the structure (the blue squares and the rank-cache) are static and can thus be shared between counters. If there are more than 1 concurrent faceted search at a time, each extra counting structure only holds the theoretically smallest worst case of bits, which for this sample is 138MB.

Testing

The micro-benchmark creates 640M sample maxima, randomly generated to follow the bit-histogram from the links-field in our sample shard, and initializes the different counter structures based on this. A list of ordinals is created with random selection, checked to ensure that each unique ordinal only occurs a number of times that is within the corresponding counter’s maximum. Caveat: Random updates are not optimal as the distribution of updates is not random for real searches: It should follow the maxima-distribution. The updates are applied to each counter implementation and performance measured as the median from 9 test-runs.

N-plane is special as it can be configured in various ways. In the schema below, N-Plane(#4, 1/100) means that there are 4 planes and overflow-bits are cached for every 100 bits.

Median updates/ms of 640M counters with max bit 23, for different amounts of updates
Implementation MB 1M upd 5M upd 10M upd 50M upd 100M upd 500M upd
N-plane(#2, 1/1K) 718 20853 13340 13059 15752 12590 4000
N-plane(#4, 1/1K) 311 20887 13339 13058 15749 12554 3969
N-plane(#6, 1/1K) 275 13375 20129 19492 11187 9482 4451
N-plane(#2, 1/100) 745 20938 13548 13460 19146 17420 7674
N-plane(#4, 1/100) 341 20929 13526 13447 18874 17035 7787
N-plane(#6, 1/100) 306 20444 13317 13222 18948 17435 7212
N-plane(#2, 1/20) 865 14027 18900 18773 13261 12380 10512
N-plane(#4, 1/20) 473 13086 20533 20386 12388 11605 11554
N-plane(#6, 1/20) 440 13423 21050 20909 12769 12003 11936
Dual-plane 1221 18734 29952 29940 18797 18794 29911
PackedInts.COMPACT 1755 13518 15324 15346 13562 13565 15296
PackedInts.FAST 2442 17995 18955 19109 19123 19142 19233

Observations

  • Performance of N-plane gets worse with the number of updates, which is to be expected: As the counts gets higher, it needs to visit more planes.
  • Dual-plane has suspicious values for 50M and 100M updates, which mirrors suspicious values for COMPACT at the same amunts of updates. As COMPACT should have the same updates/ms performance independent of the number of updates, this indicates a testing problem.
  • PackedInts.FAST is a lot faster than PackedInts.COMPACT. This might be due to chance: The number of significant bits is 23, which aligns very badly with the 64 bit longs used by the COMPACT implementation. Had the number of significant bits been 21 or 24, things might have looked different.

Conclusion

The two new players Dual-plane and N-plane are bot very interesting for high-cardinality faceting: Dual-plane as a direct replacement of PackedInts.COMPACT, N-plane as a very low-memory alternative when speed is not of the essence.

The drawback of Dual-plane is that is needs a histogram of maxima. Fortunately this is not costly: It is about the same work as is needed by PackedInts.COMPACT, which requires the overall maximum.

The drawback of N-plane is a lot higher as it needs the full maxima-set for initializing. Performance-wise this is not much of a change from the PackedInts.COMPACT maximum calculation, but it does require temporary memory in the order of 4*term_count, which in this case is 2.5GB.

Net archive indexing, round 2

March 9, 2015 by

Using our experience from our initial net archive search setup, Thomas Egense and I have been tweaking options and adding patches to the fine webarchive-discovery from UKWA for some weeks. We will be re-starting indexing Real Soon Now. So what have we learned?

  • Stored text takes up a huge part of the index: Nearly half of the total index size. The biggest sinner is not surprisingly the content field, but we need that for highlighting and potentially text extraction from search results. As we have discovered that we can avoid storing DocValued fields, at the price of increased document retrieval time, we have turned off storing for several fields.
  • DocValue everything! Or at least a lot more than we did initially. Enabling DocValues for a field and getting low-overhead faceting turned out to be a lot disk-space-cheaper than we thought. As every other feature request from the researchers seems to be “We would also like to facet on field X”, our new strategy should make them at least half happy.
  • DocValues are required for some fields. Due to internal limits on facet.method=fc without DocValues, it is simply not possible to do faceting if the number of references gets high.
  • Faceting on outgoing links is highly valuable. Being able to facet on links makes it possible to generate real-time graphs for interconnected websites. Links with host- or domain granularity are easily handled and there is no doubt that those should be enabled. Based on posivitive experimental results with document-granularity links faceting (see section below), we will also be enabling that.
  • The addition of performance instrumentation made it a lot easier for us to prioritize features. We simply do not have time for everything we can think of and some specific features were very heavy.
  • Face recognition (just finding the location of faces in images, not guessing the persons)  was an interesting feature, but with a so-so success rate. Turning it on for all images would triple our indexing time and we have little need for sampling in this area, so we will not be doing it at all for this iteration.
  • Most prominent colour extraction was only somewhat heavy, but unfortunately the resulting colour turned out to vary a great deal depending on adjustment of extraction parameters. This might be useful if a top-X of prominent colours were extracted, but for now we have turned off this feature.
  • Language detection is valuable, but processing time is non-trivial and rises linear with the number of languages to check. We lowered the number of detected languages from 20 to 10, pruning the more obscure (relative to Danish) languages.
  • Meta-data about harvesting turned out to be important for the researchers. We will be indexing the ID of the harvest-job used for collecting the data, the institution responsible and some specific sub-job-ID.
  • Disabling of image-analysis features and optimization of part of the code-base means faster indexing. Our previous speed was 7-8 days/shard, while the new one is 3-4 days/shard. As we has also doubled our indexing hardware capacity, we expect to do a full re-build of the existing index in 2 months and catching up to the present within 6 months.
  • Our overall indexing workflow, with dedicated builders creating independent shards of a fixed size, worked very well for us. Besides some minor tweaks, we will not be changing this.
  • We have been happy with Solr 4.8. Solr 5 is just out, but as re-indexing is very costly for us, we do not feel comfortable with a switch at this time. We will do the conservative thing and stick to the old Solr 4-series, which currently means Solr 4.10.4.

Document-level links faceting

The biggest new feature will be document links. This is basically all links present on all web pages at full detail. For a single test shard with 217M documents / 906GB, there were 7 billion references to 640M unique links, the most popular link being used 2.4M times. Doing a full faceted search on *:* was understandable heavy at around 4 minutes, while ad hoc testing of “standard” searches resulted in response times varying from 50 ms to 3500 ms. Scaling up to 25 shards/machine, it will be 175 billion references to 16 billion values. It will be interesting to see the accumulated response time.

We expect this feature to be used to generate visual graphs of interconnected resources, which can be navigated in real-time. Or at least you-have-to-run-to-get-coffee-time. For the curious, here is the histogram for links in the test-shard:

References #terms
1 425,799,733
2 85,835,129
4 52,695,663
8 33,153,759
16 18,864,935
32 10,245,205
64 5,691,412
128 3,223,077
256 1,981,279
512 1,240,879
1,024 714,595
2,048 429,129
4,096 225,416
8,192 114,271
16,384 45,521
32,768 12,966
65,536 4,005
131,072 1,764
262,144 805
524,288 789
1,048,576 123
2,097,152 77
4,194,304 1

 

Long tail packed counters for faceting

February 26, 2015 by

Our home brew Sparse Faceting for Solr is all about counting: When calculating a traditional String facet such as

Author
- H.C.Andersen (120)
- Brothers Grimm (90)
- Arabian Nights (7)
- Carlo Collodi (5)
- Ronald Reagan (2)

the core principle is to have a counter for each potential term (author name in this example) and update that counter by 1 for each document with that author. There are different ways of handling such counters.

Level 0: int[]

Stock Solr uses an int[] to keep track of the term counts, meaning that each unique term takes up 32 bits or 4 bytes of memory for counting. Normally that is not an issue, but with 6 billion terms (divided between 25 shards) in our Net Archive Search, this means a whopping 24GB of memory for each concurrent search request.

Level 1: PackedInts

Sparse Faceting tries to be clever about this. An int can count from 0 to 2 billion, but if the maximum number of documents for any term is 3000, there will be a lot of wasted space. 2^12 = 4096, so in the case of maxCount=3000, we only need 12 bits/term to keep track of it all. Currently this is handled by using Lucene’s PackedInts to hold the counters. With the 6 billion terms, this means 9GB of memory. Quite an improvement on the 24GB from before.

Level 2: Long tail PackedInts with fallback

Packing counters has a problem: The size of all the counters is dictated by the maxCount. Just a single highly used term can nullify the gains: If all documents share one common term, the size of the individual counters will be log(docCount) bits. With a few hundred millions of documents, that puts the size to 27-29 bits/term, very close to the int[] representation’s 32 bits.

Looking at the Author-example at the top of this page, it seems clear that the counts for the authors are not very equal: The top-2 author has counts a lot higher than the bottom-3. This is called a long tail and it is a very common pattern. This means that the overall maxCount for the terms is likely to be a lot higher than the maxCount for the vast majority of the terms.

While I was loudly lamenting of all the wasted bits, Mads Villadsen came by and solved the problem: What if we keep track of the terms with high maxCount in one structure and the ones with a lower maxCount in another structure? Easy enough to do with a lot of processing overhead, but tricky do do efficiently. Fortunately Mads also solved that (my role as primary bit-fiddler is in serious jeopardy). The numbers in the following explanation are just illustrative and should not be seen as the final numbers.

The counter structure

We have 200M unique terms in a shard. The terms are long tail-distributed, with the most common ones having maxCount in the thousands and the vast majority with maxCount below 100.

We locate the top-128 terms and see that their maxCount range from 2921 to 87. We create an int[128] to keep track of their counts and call it head.

head
Bit 31 30 29 2 1 0
Term h_0 0 0 0 0 0 0
Term h_1 0 0 0 0 0 0
0 0 0 0 0 0
Term h_126 0 0 0 0 0 0
Term h_127 0 0 0 0 0 0

The maxCount for the terms below the 128 largest ones is 85. 2^7=128, so we need 7 bits to hold each of those. We allocate a PackedInt structure with 200M entries of 7+1 = 8 bits and call it tail.

tail
Bit 7* 6 5 4 3 2 1 0
Term 0 0 0 0 0 0 0 0 0
Term 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Term 199,999,998 0 0 0 0 0 0 0 0
Term 199,999,999 0 0 0 0 0 0 0 0

The tail has an entry for all terms, including those in head. For each of the large terms in head, we locate its position in tail. At the tail-counter, we set the value to term’s index in the head counter structure and set the highest bit to 1.

Let’s say that head entry term h_0 is located at position 1 in tail, h_126 is located at position 199,999,998 and h_127 is located at position 199,999,999. After marking the head entries in the tail structure, it would look like this:

tail with marked heads
Bit 7* 6 5 4 3 2 1 0
Term 0 0 0 0 0 0 0 0 0
Term 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Term 199,999,998 1 1 1 1 1 1 1 0
Term 199,999,999 1 1 1 1 1 1 1 1

Hanging on so far? Good.

Incrementing the counters

  1. Read the counter value from the tail structure: count = tail.get(ordinal)
  2. Check if bit 7 is set: if (count & 128 == 128)
  3. If bit 7 is set, increment the head counter: head.inc(count & 127)
  4. If bit 7 is not set, increment the tail counter: tail.set(ordinal, count+1)

Pros and cons

In this example, the counters takes up 6 billion * 8 bits + 25 * 128 * 32 bits = 5.7GB. The performance overhead, compared to the PackedInts version, is tiny: Whenever a head bit is encountered, there will be an extra read to get the old head value before writing the value+1. As head will statistically be heavily used, it is likely to be in Level 2/3 cache.

This is just an example, but it should be quite realistic as approximate values from the URL field in our Net Archive Index has been used. Nevertheless, it must be stressed that the memory gains from long tail PackedInts is highly dictated by the shape of the long tail curve.

Afterthought

It is possible to avoid the extra bit in tail by treating the large terms as any other term, until their tail-counters reaches maximum (127 in the example above). When a counter’s max has been reached, the head-counter can then be located using a lookup mechanism, such as a small HashMap or maybe just a linear scan through a short array with the ordinals and counts for the large terms. This would reduce the memory requirements to approximately 6 billion * 7 bits = 5.3GB. Whether this memory/speed trade-off is better or worse is hard to guess and depends on result set size.

Implementation afterthought

The long tail PackedInts could implement the PackedInts-interface itself, making it usable elsewhere. Its constructor could take another PackedInts filled with maxCounts or a histogram with maxbit requirements.

Heureka update 2015-02-27

There is no need to mark the head terms in the tail structure up front. All the entries in tail acts as standard counters until the highest bit is set. At that moment the bits used for counting switches to be a pointer into the next available entry in the head counters. The update workflow thus becomes

  1. Read the counter value from the tail structure: count = tail.get(ordinal)
  2. Check if bit 7 is set: if (count & 128 == 128)
  3. If bit 7 is set, increment the head counter: head.inc(count & 127)
  4. If bit 7 is not set, increment the tail counter: tail.set(ordinal, count+1)
  5. If the counter reaches bit 7, change the counter-bits to be pointer-bits:
    if ((count+1) == 128) tail.set(ordinal, 128 & headpos++)

Pros and cons

The new logic means that initialization and resetting of the structure is simply a matter of filling them with 0. Update performance will be on par with the current PackedInts implementation for all counters, whose value is within the cutoff. After that the penalty of an extra read is paid, but only for the overflowing values.

The memory overhead is unchanged from the long tail PackedInts implementation and still suffers from the extra bit used for signalling count vs. pointer.

Real numbers 2015-02-28

The store-pointers-as-values has the limitation that there can only be as many head counters as the maxCount for tail. Running the numbers on the URL-field for three of the shards in our net archive index resulted in bad news: The tip of the long tail shape was not very pointy and it is only possible to shave 8% of the counter size. Far less than the estimated 30%. The Packed64 in the table below is the current structure used by sparse faceting.

Shard 1 URL: 228M unique terms, Packed64 size: 371MB
tail BPV required memory saved head size
11 342MB 29MB / 8% 106
12 371MB 0MB / 0% 6

However, we are in the process of experimenting with faceting on links, which has quite a higher point in the long tail shape. From a nearly fully build test shard we have:

8/9 build shard links: 519M unique terms, Packed64 size: 1427MB
tail BPV required memory saved head size
15 1038MB 389MB / 27% 14132
16 1103MB 324MB / 23% 5936
17 1168MB 260MB / 18% 3129
18 1233MB 195MB / 14% 1374
19 1298MB 130MB / 9% 909
20 1362MB 65MB / 5% 369
21 1427MB 0MB / 0% 58

For links, the space saving was 27% or 389MB for the nearly-finished shard. To zoom out a bit: Doing faceting on links for our full corpus with stock Solr would take 50GB. Standard sparse faceting would use 35GB and long tail would need 25GB.

Due to sparse faceting, response time for small result sets is expected to be a few seconds for the links-facet. Larger result sets, not to mention the dreaded *:* query, would take several minutes, with worst-case (qualified guess) around 10 minutes.

Three-level long tail 2015-02-28

Previously:

  • pointer-bit: Letting the values in tail switch to pointers when they reach maximum has the benefit of very little performance overhead, with the downside of taking up an extra bit and limiting the size of head.
  • lookup-signal: Letting the values in tail signal “find the counter in head” when they reach maximum, has the downside that a sparse lookup-mechanism, such as a HashMap, is needed for head, making lookups comparatively slow.

New idea: Mix the two techniques. Use the pointer-bit principle until there is no more room in head. head-counters beyond that point all get the same pointer (all value bits set) in tail and their position in head is determined by a sparse lookup-mechanism ord2head.

This means that

  • All low-value counters will be in tail (very fast).
  • The most used high-value counters will be in head and will be referenced directly from tail (fast).
  • The less used high-value counters will be in head and will require a sparse lookup in ord2head (slow).

Extending the pseudo-code from before:

value = tail.get(ordinal)
if (value == 255) { // indirect pointer signal
  head.inc(ord2head.get(ordinal))
} else if (value & 128 == 128) { // pointer-bit set
  head.inc(value & 127)
} else { // term-count = value
  value++;
  if (value != 128) { // tail-value ok
    tail.set(value)
  } else { // tail-value overflow
    head.set(headpos, value)
    if (headpos < 127) { // direct pointer
      tail.set(128 & headpos++)
    } else { // indirect pointer
      tail.set(255)
      ord2head.put(ordinal, headpos++)
    }
  }
}

Exhaustive ID filters in Solr

February 4, 2015 by

Analysis of the material in our Net Archive index is challenging for many reasons. One of them is agreeing on what we’re looking at: It is rarely the full amount of data; our researchers wants to define subsets. Subsets can be sliced in different ways: At (W)ARC level, as manually selection of specific URLs harvested at specific times, as a graph of links etc. They can also be specified with Solr. Using Solr, there are two common and technically distinct ways of specifying subsets:

  • Subset definition by simple boolean expressions: “All material from the domain example.com harvested in 2011″ or “All PDF files that mentions ‘disneyland after dark’ and has a link to domain:d-a-d.dk”.
    Such subsets are extremely easy to work with in Solr, as they are just filters: They are very fast and works fully with faceting, sorting and grouping.
  • Subset definition by weighting: “One instance of all URLs, de-duplicated by taking the one as close to June 1., 2011 as possible”.
    For some functionality, such as extracting simple top-X results, this can be handled by grouping in Solr. Unfortunately this scales poorly with faceting, groups within groups is not a Solr feature and some functionality, such as counting the number of distinct groups (the result set size) is not possible to do reliably with SolrCloud.

So what can we do about the pesky subset-by-weighting? The idea is to add support for exhaustive ID filters, where exhaustive means potentially all documents at single-document granularity level. With such a filter we could define arbitrary subsets on an existing index, not even constrained by Solr’s own powers of expression. It looks a bit daunting with billions of documents, but given some constraints it seems doable.

Creating the document list

The first problem is to determine which documents should be in the filter. If the documents are defined outside of Solr, this could just be a list of [URL, timestamp] or similar parameters uniquely identifying the documents. If the documents are to be aggregated from the SolrCloud, some manual work might be needed.

Let’s see how to do it for “One instance of all URLs, de-duplicated by taking the one as close to June 1., 2011 as possible”. In a single-shard setup it might be possible to do it with grouping on URL (1 entry/group) and sorting by distance. As stated before, this scales badly or not at all with SolrCloud and things like faceting or paging through result sets. But the principle itself is simple:

  1. For each shard, extract all the documents as tuples of [URL, date, ID], sorted by URL. This is very easily done with deep paging. If further restrictions, such as “only documents from 2010-2012″ are needed, this is done by applying a filter before extraction.
  2. Process the resulting lists individually and extend the tuples with shard-ID to [URL, date, ID, shardID].
  3. Merge-sort the list of tuples, primarily by URL, secondarily by closeness to 2011-06-01.
  4. Prune the aggregated list by only keeping the first entry for each URL.
  5. Split the aggregated list into shard-specific lists, using the shard-ID part of the tuples, keeping only the IDs.

The astute reader will have noticed that steps 1-5 can be performed in a streaming manner: The only limit to the size of the resulting ID-lists is storage space.

Named filters

At this point we have a list of entries uniquely defining documents, for each shard. We need a Solr component that can

  1. Take a user-assigned filter-ID and an URL to the list of document-definitions.
  2. Create an empty docset/bitmap, the same kind a standard filter uses.
  3. For each entry in the list of document-definitions, resolve the internal docID for the document and set that bit in the bitmap.
  4. Store the resulting bitmap for later usage, just like a standard filter, with the user-assigned filter-ID as key.
  5. When the user issues queries specifying filter-IDs, the docID bitmaps are fetched and applied as a standard Solr filters, with the same minimal processing overhead.

There is a lot of potential for optimization here:

  • It would be possible for each line in the list of document-definitions to be viewed as a full-blown Solr query, seeing the whole list as a giant boolean OR query. Lines could be processed in batches by enclosing each line in the batch in parentheses and interleaving with OR, before processing it as a plain query.
  • Treating each line as a query allows the filter to be specified any the user wants: While the granularity of the list goes down to document level, it also goes up to the full index.
  • It would be possible to see each line as a term that must match a single field (normally the ID field). Again requests could be batched, this time using the very efficient TermsQuery.
  • For ID-based lists, pre-sorting the list would open up for very efficient processing: The Solr query mechanism could be bypassed and the internal structure for the ID-field could be traversed sequentially.

Once again this can be streamed (except for the sorting idea), with no limit on the size of the list of document-definitions. Theoretically this could be tied to the generation of the corpus, avoiding temporary storage of the lists. But that would be a later project.

What we get

  • The ability for researchers to specify shared subsets of any size at document granularity, without changing the indexes.
  • Besides relevance ranking, using such filters would be equivalent to building a whole new index from the defined subset.
  • High initial cost of defining and creating filters, but extremely low subsequent overhead of using them.

Caveats

Creating the document lists would likely be very heavy, but as the premise is to create subsets of the corpus, this would normally be done once per subset defined by the researchers.

Creating the filter itself would likely be less heavy, but the processing of hundreds of millions of tiny queries, even with batching, is measured in minutes or hours. Due to the nature of filters, this step must be repeated if the index changes, so exhaustive filters would only be feasible with a rarely-updated index.Some of Solr’s build-in functionality, such as the distance to a given date for the example case, would have to be duplicated in the external merge-sorter.

Prior work

Constraining searches by large lists of IDs is not a new idea.

  • Defining the IDs at query time is not feasible at net archive scale; even with bit packing a request could potentially be measured in gigabytes.
  • Treating the task as a security filter problem holds promises, at least for the filtering part.
  • Adding a token to the documents in a subset would work, but it makes experimentation very heavy and (of course) requires the indexes to be updated.
  • The Terms Filter Lookup in Elasticsearch seems quite like the persistent filter described above. As does SOLR-1715, which unfortunately is talk-only at this point.
  • Update 20140205: The example of generating the document list seems to fit extremely well with Heliosearch’s Streaming Aggregation, which will hopefully be part of Solr with SOLR-7082.

British Library and IIPCTech15

February 2, 2015 by

For a change of pace: A not too technical tale of my recent visit to England.

The people behind IIPC Technical Training Workshop – London 2015 had invited yours truly as a speaker and participant in the technical training. IIPC stands for International Internet Preservation Consortium and I were to talk about using Solr for indexing and searching preserved Internet resources. That sounded interesting and Statsbiblioteket encourages interinstitutional collaboration, so the invitation was gladly accepted. Some time passed and British Library asked if I might consider arriving a few days early and visit their IT development department? Well played, BL, well played.

I kid. For those not in the know, British Library made the core software we use for our Net Archive indexing project and we are very thankful for that. Unfortunately they do have some performance problems. Spending a few days, primarily talking about how to get their setup to work better, was just reciprocal altruism working. Besides, it turned out to be a learning experience for both sides.

It is the little things, like the large buses

It is the little things, like the large buses

At British Library, Boston Spa

The current net archive oriented Solr setups at British Library is using SolrCloud with live indexes on machines with spinning drives (aka harddisks) and a – relative to index size – low amount of RAM. At Statsbiblioteket, our experience tells us that such setups generally have very poor performance. Gil Hoggarth and I discussed Solr performance at length and he was tenacious on exploring every option available. Andy Jackson partook in most of the debates. Log file inspections and previous measurements from the Statsbiblioteket setups seemed to sway them in favour of different base hardware, or to be specific: Solid State Drives. The open question is how much such a switch would help or if it would be a better investment to increase the amount of free memory for caching.

  • A comparative analysis of performance with spinning drives vs. SSDs for multi-TB Solr indexes on machines with low memory would help other institutions tremendously, when planning and designing indexing solutions for net archives.
  • A comparative analysis of performance with different amounts of free memory for caching, as a fraction of index size, for both spinning drives and SSDs, would be beneficial on a broader level; this would give an idea of how to optimize bang-for-the-buck.
Illuminate the roads ahead

Illuminate the road ahead

Logistically the indexes at British Library are quite different from the index at Statsbiblioteket: They follow the standard Solr recommendation and treats all shards as a single index, both for index and search. At Statsbiblioteket, shards are build separately and only treated as a whole index at search time. The live indexes at British Library have some downsides, namely re-indexing challenges, distributed indexing logistics overhead and higher hardware requirements. They also have positive features, primarily homogeneous shards and the ability to update individual documents. The updating of individual documents is very useful for tracking meta-data for resources that are harvested at different times, but have unchanged content. Tracking of such content, also called duplicate handling, is a problem we have not yet considered in depth at Statsbiblioteket. One of the challenges of switching to static indexes is thus:

  • When a resource is harvested multiple times without the content changing, it should be indexed in such a way that all retrieval dates can be extracted and such that the latest (and/or the earliest?) harvest date can be used for sorting, grouping and/or faceting.

One discussed solution is to add a document for each harvest date and use Solr’s grouping and faceting features to deliver the required results. The details are a bit fluffy as the requirements are not strictly defined.

At the IIPC Technical Training Workshop, London 2015

The three pillars of the workshop were harvesting, presentation and discovery, with the prevalent tools being Heritrix, Wayback and Solr. I am a newbie in two thirds of this world, so my outsider thoughts will focus on discovery. Day one was filled with presentations, with my Scaling Net Archive Indexing and Search as the last one. Days two and three were hands-on with a lot of discussions.

As opposed to the web archive specific tools Heritrix and Wayback, Solr is a general purpose search engine: There is not yet a firmly established way of using Solr to index and search net archive material, although the work from UKWA is a very promising candidate. Judging by the questions asked at the workshop, large scale full-text search is relatively new in the net archive world and as such the community lacks collective experience.

Two large problems of indexing net archive material is analysis and scaling. As stated, UKWA has the analysis part well in hand. Scaling is another matter: Net archives typically contains billions of documents, many of them with a non-trivial amount of indexable data (webpages, PDFs, DOCs etc). Search responses ideally involve grouping or faceting, which requires markedly more resources than simple search. Fortunately, at least from a resource viewpoint, most countries does not allow harvested material to be made available to the general public: The number of users and thus concurrent requests tend to be very low.

General recommendations for performant Solr systems tend to be geared towards small indexes or high throughput, minimizing the latency and maximizing the number of requests that can be processed by each instance. Down to Earth, the bottleneck tend to be random reads from the underlying storage, easily remedied by adding copious amounts of RAM for caching. While the advice arguable scales to net archive indexes in the multiple TB-range, the cost of terabytes of RAM, as well as the number of machines needed to hold them, is often prohibitive. Bearing in mind that the typical user groups on net archives consists of very few people, the part about maximizing the number of supported requests is overkill. With net archives as outliers in the Solr world, there is very little existing shared experience to provide general recommendations.

  • As hardware cost is a large fraction of the overall cost of doing net archive search, in-depth descriptions of setups are very valuable to the community.
All different, yet the same

All different, yet the same

Measurements from British Library as well as Statsbiblioteket shows that faceting on high cardinality fields is a resource hog when using SolrCloud. This is problematic for exploratory use of the index. While it can be mitigated with more hardware or software optimization, switching to heuristic counting holds promises of very large speed ups.

  • The performance benefits and the cost in precision of approximate search results should be investigated further. This area is not well-explored in Solr and mostly relies on custom implementations.

On the flipside of fast exploratory access is the extraction of large result sets for further analysis. SolrCloud does not scale for certain operations, such as deep paging within facets and counting of unique groups. Certain operations, such as percentiles in the AnalyticsComponent, are not currently possible. As the alternative to using the index tend to be very heavy Hadoop processing of the raw corpus, this is an area worth investing in.

  • The limits of result set extractions should be expanded and alternative strategies, such as heuristic approximation and per-shard processing with external aggregation, should be attempted.

On a personal note

Visiting British Library and attending the IIPC workshop was a blast. Being embedded in tech talk with intelligent people for 5 days was exhausting and very fulfilling. Thank you all for the hospitality and for pushing back when my claims sounded outrageous.

Finding the bottleneck

January 6, 2015 by

While our Net Archive search performs satisfactory, we would like to determine how well-balanced the machine is. To recap, it has 16 cores (32 with Hyper Threading), 256GB RAM and 25 Solr shards @ 900GB. When running it uses about 150GB for Solr itself, leaving 100GB memory for disk cache.

Test searches are for 1-3 random words from a Danish dictionary, with faceting on 1 very large (billions of unique values, billions of references), 2 large fields (millions of unique values, billions of references) and 3 smaller fields (thousands of unique values, billions of references). Unless otherwise noted, searches were issued one request at a time.

Scaling cores

Under Linux it is quite easy to control which cores a process utilizes, by using the command taskset. We tried scaling that by doing the following with different cores:

  1. Shut down all Solr instances
  2. Clear the disk cache
  3. Start up all Solr instances, limited to specific cores
  4. Run the standard performance test

In the chart below, ht means that there is the stated number of cores + their Hyper Threaded counterpart. In other words, 8ht means 8 physical cores but 16 virtual ones.

Performance with different number of cores

Scaling number of cores and use of Hyper Threading

Observations:

  • Hyper Threading does provide a substantial speed boost.
  • The differences between 8ht cores and 16 or 16ht cores are not very big.

Conclusion: For standard single searches, which is the design scenario, 16 cores seems to be overkill. More complex queries would likely raise the need for CPU though.

Scaling shards

Changing the number of shards on the SolrCloud setup was simulated by restricting queries to run on specific shards, using the argument shards. This was not the best test as it measured the combined effect of the shard-limitation and the percentage of the index held in disk cache; e.g. limiting the query to shard 1 & 2 meant that about 50GB of memory would be used for disk cache per shard, while limiting to shard 1, 2, 3, & 4 meant only 25GB of disk cache per shard.

Scaling with shard count

Scaling number of active shards

Note: These tests were done on performance degraded drives, so the actual response times are too high. The relative difference should be representative enough.

Observations:

  • Performance for 1-8 shards is remarkably similar.
  • Going from 8 to 16 shards is 100% more data at half performance.
  • Going from 16 to 24 shards is only 50% more data, but also halves performance.

Conclusion: Raising the number of shards further on an otherwise unchanged machine would likely degrade performance fast. A new machine seems like the best way to increase capacity, the less guaranteed alternative being more RAM.

Scaling disk cache

A Java program was used to reserve part of the free memory, by allocating a given amount of memory as long[] and randomly changing the content. This effectively controlled the amount of memory available for disk cache for the Solr instances. The Solr instances were restarted and the disk cache cleared between each test.

Scaling memory

Scaling free memory for disk cache

Observations:

  • A maximum running time of 10 minutes was far too little for this test, leaving very few measuring points for 54GB, 27GB and 7GB disk cache.
  • Performance degrades exponentially when the amount of disk cache drops below 100GB.

Conclusion: While 110GB (0.51% of the index size) memory for disk cache delivers performance well within our requirements, it seems that we cannot use much of the free memory for other purposes. It would be interesting to see how much performance would increase with even more free memory, for example by temporarily reducing the number of shards.

Scaling concurrent requests

Due to limited access, we only need acceptable performance for one search at a time. Due to the high cardinality (~6 billion unique values) URL-field, the memory requirements for a facet call is approximately 10GB, severely limiting the maximum number of concurrent requests. Nevertheless, it is interesting to see how much performance changes when the number of concurrent requests rises. To avoid reducing the disk cache, we only tested with 1 and 2 concurrent requests.

Scaling concurrent requests

Scaling concurrent request with heavy faceting

Observations (over multiple runs; only one run in shown in the graph):

  • For searches with small to medium result sets (aka “normal” searches), performance for 2 concurrent requests was nearly twice as bad as for 1 request.
  • For searches with large result sets, performance for 2 concurrent requests were more than twice as bad as for 1 request. This is surprising as a slightly better than linear performance drop were expected.

Conclusion: Further tests seems to be in order due to the surprisingly bad scaling. One possible explanation is that memory speed is the bottleneck. Limiting that number to 1 or 2 and queuing further requests is the best option for maintaining a stable system due to memory overhead.

Scaling requirements

Admittedly, the whole facet-on-URL-thing might not be essential for the user experience. If we avoid faceting on that field and only facet on the more sane fields, such as host, domain and 3 smaller fields, we can turn up the number of concurrent requests without negative impact on disk cache.

Scaling concurrency

Scaling concurrent requests with moderate faceting

Observations:

  • Mean performance with 1 concurrent request is 10 times better, compared to the full faceting scenario.
  • From 1-4 threads, latency increases (bad) and throughput improves (good).
  • From 4-32 threads, latency increases (bad) but throughput does not improve (double bad).

Conclusion: As throughput does not improve for more than 4 concurrent threads, using a limit of 4 seems beneficial. However, as we are planning to add faceting on links_domains and links_hosts, as well as grouping on URL, the measured performance is not fully representative of future use of search in the Net Archive.

Samsung 840 EVO degradation

December 28, 2014 by

Rumour has it that our 25 lovely 1TB Samsung 840 EVO drives in our Net Archive search machine does not perform well, when data are left untouched for months. Rumour in this case being solid confirmation with a firmware fix from Samsung. In our setup, index shards are written once, then left untouched for months or even years. Exactly the circumstances that trigger the performance degradation.

Measurements, please!

Our 25 shards are build over half a year, giving us an unique opportunity to measure drives in different states of decay. First experiment was very simple: Just read all the data from the drive sequentially by issuing cat index/* > /dev/null and plot the measured time spend with the age of the files on the x axis. That shows the impact on bulk read speed. Second experiment was to issue Solr searches to each shard in isolation, testing search speed one drive at a time. That shows the impact on small random reads.

Samsung 840 EVO performance degradationFor search as well as bulk read, lower numbers are better. The raw numbers for 7 drives follows:

Months 25% search median search 75% search 95% search mean search Bulk read hours
0.9 36 50 69 269 112 1.11
2.0 99 144 196 486 203 3.51
3.0 133 198 269 590 281 6.06
4.0 141 234 324 670 330 8.70
5.1 133 183 244 520 295 5.85
6.0 106 158 211 433 204 5.23
7.0 105 227 333 703 338 10.49

Inspecting the graph, it seems that search performance quickly gets worse until the data are about 4 months old. After that it stabilizes. Bulk reading on the other hand continue to worsen during all 7 months, but that has little relevance for search.

The Net Archive search uses SolrCloud for querying the 25 shards simultaneously and merging the result. We only had 24 shards at the previous measurement 6 weeks ago, but the results should still be comparable. Keep in mind that our goal is to have median response times below 2 seconds for all standard searches; searches matching the full corpus and similar are allowed to take longer.

Performance for full searchThe distinct hill is present both for the old and the new measurements: See Even sparse faceting is limited for details. But the hill has grown for the latest measurements; response times has nearly doubled for the slowest searches. How come it got that much worse during just 6 weeks?

Theory: In a distributed setup, the speed is dictated by the slowest shard. As the data gets older on the un-patched Samsung drives, the chances of having slow reads rises. Although the median response time for search on a shard with 3 month old data is about the same as one with 7 month old data, the chances of very poor performance searches rises. As the whole collection of drives got 6 weeks older, the chances of not having poor performance from at least one of the drives during a SolrCloud search fell.

Note how our overall median response time actually got better with the latest measurement, although the mean (average) got markedly worse. This is due to the random distribution of result set sizes. The chart paint a much clearer picture.

Well, fix it then!

The good news is that there is a fix from Samsung. The bad news is that we cannot upgrade the drives using the controller on the server. Someone has to go through the process of removing them from the machine and perform the necessary steps on a workstation. We plan on doing this in January and besides the hassle and the downtime, we foresee no problems with it.

However, as the drive bug is for old data, a rewrite of all the 25*900GB files should freshen the charges and temporarily bring them back to speed. Mads Villadsen suggested using dd if=somefile of=somefile conv=notrunc, so let’s try that. For science!

*crunching*

It took nearly 11 hours to process drive 1, which had the oldest data. That fits well with the old measurement of bulk speed for that drive, which was 10½ hour for 900GB. After that, bulk speed increased to 1 hour for 900GB. Reviving the 24 other drives was done in parallel with a mean speed of 17MB/s, presumably limited by the controller. Bulk read speeds for the reviewed drives was 1 hour for 900GB, except for drive 3 which took 1 hour and 17 minutes. Let’s file that under pixies.

Repeating the individual shard search performance test from before, we get the following results:Samsung 840 EVO performance reviving
Note that the x-axis is now drive number instead of data age. As can be seen, the drives are remarkably similar in performance. Comparing to the old test, they are at the same speed as the drive with 1 month old data, indicating that the degradation sets in after more than 1 month and not immediately. The raw numbers for the same 7 drives as listed in the first table are:

Months 25% search median search 75% search 95% search mean search Bulk read hours
0.3 34 52 69 329 106 1.00
0.3 41 55 69 256 104 1.00
0.3 50 63 85 330 131 1.00
0.3 39 58 77 301 108 1.00
0.3 40 57 74 314 106 1.01
0.3 37 50 66 254 96 1.01
0.3 24 33 51 344 98 1.00

Running the full distributed search test and plotting the results together with the 1½ month old measurements as well as the latest measurements with the degraded drives gives us the following.

Samsung 840 EVO performance reviving

Performance is back to the same level as 1½ month ago, but how come it is not better than that? A quick inspection of the machine revealed that 2 backup jobs had started and were running during the last test; it is unknown how heavy that impact is on the drives, so the test will be re-run when the backups has finished.

Conclusion

The performance degradation of non-upgraded Samsung 840 EVO drives is very real and the degradation is serious after a couple of months. Should you own such drives, it is highly advisable to apply the fixes from Samsung.


Follow

Get every new post delivered to your Inbox.