Facet filtering

April 10, 2015 by

In generation 2 of our net archive search we plan to experiment with real time graphs: We would like to visualize links between resources and locate points of interest based on popularity. Our plan is to use faceting with Solr on 500M+ unique links per shard, which is a bit of challenge in itself. To make matters worse, plain faceting does not fully meet the needs of the researchers. Some sample use cases for graph building are

  1. The most popular resources that pages about gardening links to overall
  2. The most popular resources that pages on a given site links to externally
  3. The most popular images that pages on a given site links to internally
  4. The most popular non-Danish resources that Danish pages links to
  5. The most popular JavaScripts that all pages from a given year links to

Unfortunately, only the first one can be solved with plain faceting.

Blacklists & whitelists with regular expressions

The idea is to filter all viable term candidates through a series of blacklists and whitelists to check whether they should be part of the facet result or not. One flexible way of expressing conditions on Strings is with regular expressions. The main problem with that approach is that all the Strings for the candidates must be resolved, instead of only the ones specified by facet.limit.

Consider the whitelist condition .*wasp.* which matches all links containing the word wasp. That is a pretty rare word overall, so if a match-all query is issued and the top 100 links with the wasp-requirement are requested, chances are that millions of terms must be resolved to Strings and checked, before the top 100 allowed ones has been found. On the other hand, a search for gardening would likely have a much higher chance of wasp-related links and would thus require far less resolutions.

An extremely experimental (written today) implementation of facet filtering has been added to the pack branch of Sparse Faceting for Solr. Correctness testing has been performed, where testing means “tried it a few times and the results looked plausible”. Looking back at the cases in the previous section, facet filtering could be used to support them:

  1. The most popular resources that pages about gardening links to overall
    q=gardening
  2. The most popular resources that pages on a given site links to externally
    q=domain:example.com&facet.sparse.blacklist=https?://[^/]*example\.com
  3. The most popular images that pages on a given site links to internally
    q=domain:example.com&facet.sparse.whitelist=https?://[^/]*example\.com/.*\.(gif|jpg|jpeg|png)$
  4. The most popular non-Danish resources that Danish pages links to
    q=domain_suffix:dk&facet.sparse.blacklist=https?://[^/]*\.dk
  5. The most popular JavaScripts that all pages from a given year links to
    q=harvest_year:2015&facet.sparse.whitelist=.*\.js$

Some questions like “The most popular resources larger than 2MB in size linked to from pages about video” cannot be answered directly with this solution as they rely on the resources at the end of the links, not just the links themselves.

Always with the performance testing

Two things of interest here:

  1. Faceting on 500M+ unique values (5 billion+ DocValues references) on a 900GB single-shard index with 200M+ documents
  2. Doing the trick with regular expressions on top

Note the single-shard thing! The measurements should not be taken as representative for the final speed of the fully indexed net archive, which will be 50 times larger. As we get more generation 2 shards, the tests will hopefully be re-run.

As always, Sparse Faceting is helping tremendously with the smaller result sets. This means that averaging the measurements to a single number is highly non-descriptive: Response times varies from < 100ms for a few thousand hits to 5 minutes for a match-all query.

Performance testing used a single thread to issue queries with random words from a Danish dictionary. The Solr server was a 24 core Intel i7 machine (only 1 active core due to the unfortunate single-threaded nature of faceting) with 256GB of RAM (200GB free for disk cache) and SSDs. All tests were with previously unused queries. 5 different types of requests were issued:

  1. no_facet: as the name implies, just a plain search with no faceting
  2. sparse: Sparse Faceting on the single links-field with facet limit 25
  3. regexp_easy: Sparse Faceting with whitelist regular expression .*htm.* which is fairly common in links
  4. regexp_evil: Sparse Faceting with whitelist regular expression .*nevermatches.* effectively forcing all terms in the full potential result set to be resolved and checked
  5. solr: Vanilla Solr faceting
900GB, 200M+ docs, 500M+ unique values, 5 billion+ references

900GB, 200M+ docs, 500M+ unique values, 5 billion+ references

Observations

  • Sparse Faceting without regular expressions (purple) performs just as well with 500M+ values as it did with previous tests of 200M+ values.
  • Using a regular expression that allows common terms (green) has moderate impact on performance.
  • The worst possible regular expression (orange) has noticeable impact at 10,000 hits and beyond. At the very far end at match-all, the processing time was 10 minutes (versus 5 minutes for non-regular expression faceting). This is likely to be highly influenced by storage speed and be slower with more shards on the same machine.
  • The constant 2 second overhead of vanilla Solr faceting (yellow) is highly visible.

Conclusion

Worst case processing times has always been a known weakness of our net archive search. Facet filtering exacerbates this. As this is tightly correlated to the result set size, which is fast to calculate, adding a warning with “This query is likely to take minutes to process” could be a usable bandage.

With that caveat out of the way, the data looks encouraging so far; the overhead for regular expressions was less than feared. Real-time graphs or at least fill-the-coffee-cup-time graphs seems doable. At the cost of 2GB of extra heap per shard to run the faceting request.

Additional notes 2015-04-11

@maxxkrakoa noted “@TokeEskildsen you wrote Solr facet is 1 thread. facet.threads can change that – but each field will only be able to use one thread each.“. He is right and it does help significantly for our 6 field faceting. For single field faceting, support for real multi-thread counting would be needed.

The simple way of doing multi-thread counting is to update multiple copies of the counter structure and merge them at the end. For at 500M+ field, that is likely to be prohibitive with regards to both memory and speed: The time used for merging the multiple counters would likely nullify the faster counter update phase. Some sort of clever synchronization or splitting of the counter space would be needed. No plans yet for that part, but it has been added to “things to ponder when sleep is not coming”-list.

Additional notes 2015-04-13

It seems that Java 1.7’s AtomicLongArray performs very well: Quite comparable to plain long[] in the context of multi-millions of counters, where contention is low. This raises the probability of implementing true threaded faceting markedly.

N-plane packed counters for faceting

March 13, 2015 by

Faceting in Solr works well out of the box up to some millions of unique values in the facet field. There is a small performance penalty linear to the number of unique values, which begins to show after some point. Sparse faceting solves this and makes faceting performance linear to the result size. Sparse faceting also reduces memory requirements by packing the values tighter. As seen in the blog post Long tail packed counters for faceting, this packing can be improved, depending on the concrete term distribution in the facet field.

An implementation of Long tail packed counters has been created and renamed to Dual-plane counters. During development, a new idea sprung to life in the form of N-plane counters. In this blog post the different counters will be explained and performance measurements from a micro-benchmark (yes, we know, they are not very reliable) based on real-world data will be presented.

Quick re-cap from the long tail blog post:

  • We have hundreds of millions of counters, referenced by ordinal
  • Each counter corresponds to a unique term in a facet field
  • We know the maximum for each individual counter
  • The overall distribution of the maxima is long tail
  • We want the overall counter structure to be small
  • We want the overall counter structure to be fast

Instead of viewing the maxima as plain numbers, they can be viewed as required bits. If a counter has a maximum of 15, all possible values can be expressed with 4 bits, as 2⁴-1 = 15. Similarly a counter with a maximum of 100 needs 7 bits, as 2⁷-1 = 127. The collection of counters can be visualized as

Term counters as bits

The counter for term A needs 1 bit, term B needs 3 bits, term C needs 1 bit and term D needs 11 bits. If we sorted the terms according to maxima instead of alphanumeric, the distribution would look like

Shape of maxima sorted by maxima

However, as the order of the terms is dictated by Solr, sorting the terms for use with faceting would require a mapping from Solr’s term ordinals to the position in the sorted list. This option has not been explored yet, so for now we’ll stick with the original order.

Test data

We extracted the maxima for the 640M unique links in a test-shard from our net archive and grouped them by the number of bits needed for expressing those maxima. The long tail distribution is evident. The theoretically optimal worst-case representation of the counters is the collective number of bits needed (the white squares seen above). For the sample below that is 138MB.

Bits #terms
1 425,799,733
2 85,835,129
3 52,695,663
4 33,153,759
5 18,864,935
6 10,245,205
7 5,691,412
8 3,223,077
9 1,981,279
10 1,240,879
11 714,595
12 429,129
13 225,416
14 114,271
15 45,521
16 12,966
17 4,005
18 1,764
19 805
20 789
21 123
22 77
23 1

The contestants

Counter-representation with atomic numeric

Stock Solr: Counter-representation with atomic numeric (y goes to 32, only bit 1-16 shown)

Stock Solr faceting uses an int[] to hold the counts. It is simple and fast. In the visualization above, the white squares are bits holding counter values, while the red area represents bits that are always 0. There is a lot of red with this counter structure. The sample counters takes up 2442MB with Stock Solr faceting. Pro: speed, con: space.

Sparse Faceting: Counter-representation with PackedInts

Sparse Faceting: Counter-representation with PackedInts

Sparse Faceting has the option of using PackedInts for counters. It lowers the overhead ceiling, but there is still a lot of wasted space. With Sparse Faceting, 1755MB is needed for the sample counters. Pro: speed, con: space + needs to know overall maximum.

Dual-plane: Counter-representation in low-bits with references to high-bit

Dual-plane: Counter-representation in low-bits with references to high-bit

Holding low counts in one structure and switching high-counts to another structure requires 1 extra bit/value for signalling which counters has a high value (see Long tail packed counters for faceting for details). Even with the extra bit, this gives an overall size reduction. Dual-plane uses 1221MB for the sample counters. Pro: speed, con: needs histogram of maxima for initialization.

N-plane: Horizontal slicing of counters

N-plane: Horizontal slicing of counters

Extending Dual-plane to an arbitrary number of planes and replacing the explicit pointers with counting logic, N-plane worst-case representation takes up just 2x the size of the raw bits. In the illustration above, the blue boxes are bits representing overflow up to the next plane. The overflow bits are counted to calculate the offset in the higher planes.
To avoid horrible performance, a rank structure is needed, which adds a bit of overhead. Likewise, it might be advisable to limit the number of planes, to avoid too many random memory requests. Most promising space/performance setup for the sample data takes up 341MB for N-plane, with the smallest & slowest taking up 275MB. Pro: space, con: speed + needs all counter maxima for initialization.

Bonus: All the logistic parts of the structure (the blue squares and the rank-cache) are static and can thus be shared between counters. If there are more than 1 concurrent faceted search at a time, each extra counting structure only holds the theoretically smallest worst case of bits, which for this sample is 138MB.

Testing

The micro-benchmark creates 640M sample maxima, randomly generated to follow the bit-histogram from the links-field in our sample shard, and initializes the different counter structures based on this. A list of ordinals is created with random selection, checked to ensure that each unique ordinal only occurs a number of times that is within the corresponding counter’s maximum. Caveat: Random updates are not optimal as the distribution of updates is not random for real searches: It should follow the maxima-distribution. The updates are applied to each counter implementation and performance measured as the median from 9 test-runs.

N-plane is special as it can be configured in various ways. In the schema below, N-Plane(#4, 1/100) means that there are 4 planes and overflow-bits are cached for every 100 bits.

Median updates/ms of 640M counters with max bit 23, for different amounts of updates
Implementation MB 1M upd 5M upd 10M upd 50M upd 100M upd 500M upd
N-plane(#2, 1/1K) 718 20853 13340 13059 15752 12590 4000
N-plane(#4, 1/1K) 311 20887 13339 13058 15749 12554 3969
N-plane(#6, 1/1K) 275 13375 20129 19492 11187 9482 4451
N-plane(#2, 1/100) 745 20938 13548 13460 19146 17420 7674
N-plane(#4, 1/100) 341 20929 13526 13447 18874 17035 7787
N-plane(#6, 1/100) 306 20444 13317 13222 18948 17435 7212
N-plane(#2, 1/20) 865 14027 18900 18773 13261 12380 10512
N-plane(#4, 1/20) 473 13086 20533 20386 12388 11605 11554
N-plane(#6, 1/20) 440 13423 21050 20909 12769 12003 11936
Dual-plane 1221 18734 29952 29940 18797 18794 29911
PackedInts.COMPACT 1755 13518 15324 15346 13562 13565 15296
PackedInts.FAST 2442 17995 18955 19109 19123 19142 19233

Observations

  • Performance of N-plane gets worse with the number of updates, which is to be expected: As the counts gets higher, it needs to visit more planes.
  • Dual-plane has suspicious values for 50M and 100M updates, which mirrors suspicious values for COMPACT at the same amunts of updates. As COMPACT should have the same updates/ms performance independent of the number of updates, this indicates a testing problem.
  • PackedInts.FAST is a lot faster than PackedInts.COMPACT. This might be due to chance: The number of significant bits is 23, which aligns very badly with the 64 bit longs used by the COMPACT implementation. Had the number of significant bits been 21 or 24, things might have looked different.

Conclusion

The two new players Dual-plane and N-plane are bot very interesting for high-cardinality faceting: Dual-plane as a direct replacement of PackedInts.COMPACT, N-plane as a very low-memory alternative when speed is not of the essence.

The drawback of Dual-plane is that is needs a histogram of maxima. Fortunately this is not costly: It is about the same work as is needed by PackedInts.COMPACT, which requires the overall maximum.

The drawback of N-plane is a lot higher as it needs the full maxima-set for initializing. Performance-wise this is not much of a change from the PackedInts.COMPACT maximum calculation, but it does require temporary memory in the order of 4*term_count, which in this case is 2.5GB.

Net archive indexing, round 2

March 9, 2015 by

Using our experience from our initial net archive search setup, Thomas Egense and I have been tweaking options and adding patches to the fine webarchive-discovery from UKWA for some weeks. We will be re-starting indexing Real Soon Now. So what have we learned?

  • Stored text takes up a huge part of the index: Nearly half of the total index size. The biggest sinner is not surprisingly the content field, but we need that for highlighting and potentially text extraction from search results. As we have discovered that we can avoid storing DocValued fields, at the price of increased document retrieval time, we have turned off storing for several fields.
  • DocValue everything! Or at least a lot more than we did initially. Enabling DocValues for a field and getting low-overhead faceting turned out to be a lot disk-space-cheaper than we thought. As every other feature request from the researchers seems to be “We would also like to facet on field X”, our new strategy should make them at least half happy.
  • DocValues are required for some fields. Due to internal limits on facet.method=fc without DocValues, it is simply not possible to do faceting if the number of references gets high.
  • Faceting on outgoing links is highly valuable. Being able to facet on links makes it possible to generate real-time graphs for interconnected websites. Links with host- or domain granularity are easily handled and there is no doubt that those should be enabled. Based on posivitive experimental results with document-granularity links faceting (see section below), we will also be enabling that.
  • The addition of performance instrumentation made it a lot easier for us to prioritize features. We simply do not have time for everything we can think of and some specific features were very heavy.
  • Face recognition (just finding the location of faces in images, not guessing the persons)  was an interesting feature, but with a so-so success rate. Turning it on for all images would triple our indexing time and we have little need for sampling in this area, so we will not be doing it at all for this iteration.
  • Most prominent colour extraction was only somewhat heavy, but unfortunately the resulting colour turned out to vary a great deal depending on adjustment of extraction parameters. This might be useful if a top-X of prominent colours were extracted, but for now we have turned off this feature.
  • Language detection is valuable, but processing time is non-trivial and rises linear with the number of languages to check. We lowered the number of detected languages from 20 to 10, pruning the more obscure (relative to Danish) languages.
  • Meta-data about harvesting turned out to be important for the researchers. We will be indexing the ID of the harvest-job used for collecting the data, the institution responsible and some specific sub-job-ID.
  • Disabling of image-analysis features and optimization of part of the code-base means faster indexing. Our previous speed was 7-8 days/shard, while the new one is 3-4 days/shard. As we has also doubled our indexing hardware capacity, we expect to do a full re-build of the existing index in 2 months and catching up to the present within 6 months.
  • Our overall indexing workflow, with dedicated builders creating independent shards of a fixed size, worked very well for us. Besides some minor tweaks, we will not be changing this.
  • We have been happy with Solr 4.8. Solr 5 is just out, but as re-indexing is very costly for us, we do not feel comfortable with a switch at this time. We will do the conservative thing and stick to the old Solr 4-series, which currently means Solr 4.10.4.

Document-level links faceting

The biggest new feature will be document links. This is basically all links present on all web pages at full detail. For a single test shard with 217M documents / 906GB, there were 7 billion references to 640M unique links, the most popular link being used 2.4M times. Doing a full faceted search on *:* was understandable heavy at around 4 minutes, while ad hoc testing of “standard” searches resulted in response times varying from 50 ms to 3500 ms. Scaling up to 25 shards/machine, it will be 175 billion references to 16 billion values. It will be interesting to see the accumulated response time.

We expect this feature to be used to generate visual graphs of interconnected resources, which can be navigated in real-time. Or at least you-have-to-run-to-get-coffee-time. For the curious, here is the histogram for links in the test-shard:

References #terms
1 425,799,733
2 85,835,129
4 52,695,663
8 33,153,759
16 18,864,935
32 10,245,205
64 5,691,412
128 3,223,077
256 1,981,279
512 1,240,879
1,024 714,595
2,048 429,129
4,096 225,416
8,192 114,271
16,384 45,521
32,768 12,966
65,536 4,005
131,072 1,764
262,144 805
524,288 789
1,048,576 123
2,097,152 77
4,194,304 1

 

Long tail packed counters for faceting

February 26, 2015 by

Our home brew Sparse Faceting for Solr is all about counting: When calculating a traditional String facet such as

Author
- H.C.Andersen (120)
- Brothers Grimm (90)
- Arabian Nights (7)
- Carlo Collodi (5)
- Ronald Reagan (2)

the core principle is to have a counter for each potential term (author name in this example) and update that counter by 1 for each document with that author. There are different ways of handling such counters.

Level 0: int[]

Stock Solr uses an int[] to keep track of the term counts, meaning that each unique term takes up 32 bits or 4 bytes of memory for counting. Normally that is not an issue, but with 6 billion terms (divided between 25 shards) in our Net Archive Search, this means a whopping 24GB of memory for each concurrent search request.

Level 1: PackedInts

Sparse Faceting tries to be clever about this. An int can count from 0 to 2 billion, but if the maximum number of documents for any term is 3000, there will be a lot of wasted space. 2^12 = 4096, so in the case of maxCount=3000, we only need 12 bits/term to keep track of it all. Currently this is handled by using Lucene’s PackedInts to hold the counters. With the 6 billion terms, this means 9GB of memory. Quite an improvement on the 24GB from before.

Level 2: Long tail PackedInts with fallback

Packing counters has a problem: The size of all the counters is dictated by the maxCount. Just a single highly used term can nullify the gains: If all documents share one common term, the size of the individual counters will be log(docCount) bits. With a few hundred millions of documents, that puts the size to 27-29 bits/term, very close to the int[] representation’s 32 bits.

Looking at the Author-example at the top of this page, it seems clear that the counts for the authors are not very equal: The top-2 author has counts a lot higher than the bottom-3. This is called a long tail and it is a very common pattern. This means that the overall maxCount for the terms is likely to be a lot higher than the maxCount for the vast majority of the terms.

While I was loudly lamenting of all the wasted bits, Mads Villadsen came by and solved the problem: What if we keep track of the terms with high maxCount in one structure and the ones with a lower maxCount in another structure? Easy enough to do with a lot of processing overhead, but tricky do do efficiently. Fortunately Mads also solved that (my role as primary bit-fiddler is in serious jeopardy). The numbers in the following explanation are just illustrative and should not be seen as the final numbers.

The counter structure

We have 200M unique terms in a shard. The terms are long tail-distributed, with the most common ones having maxCount in the thousands and the vast majority with maxCount below 100.

We locate the top-128 terms and see that their maxCount range from 2921 to 87. We create an int[128] to keep track of their counts and call it head.

head
Bit 31 30 29 2 1 0
Term h_0 0 0 0 0 0 0
Term h_1 0 0 0 0 0 0
0 0 0 0 0 0
Term h_126 0 0 0 0 0 0
Term h_127 0 0 0 0 0 0

The maxCount for the terms below the 128 largest ones is 85. 2^7=128, so we need 7 bits to hold each of those. We allocate a PackedInt structure with 200M entries of 7+1 = 8 bits and call it tail.

tail
Bit 7* 6 5 4 3 2 1 0
Term 0 0 0 0 0 0 0 0 0
Term 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Term 199,999,998 0 0 0 0 0 0 0 0
Term 199,999,999 0 0 0 0 0 0 0 0

The tail has an entry for all terms, including those in head. For each of the large terms in head, we locate its position in tail. At the tail-counter, we set the value to term’s index in the head counter structure and set the highest bit to 1.

Let’s say that head entry term h_0 is located at position 1 in tail, h_126 is located at position 199,999,998 and h_127 is located at position 199,999,999. After marking the head entries in the tail structure, it would look like this:

tail with marked heads
Bit 7* 6 5 4 3 2 1 0
Term 0 0 0 0 0 0 0 0 0
Term 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Term 199,999,998 1 1 1 1 1 1 1 0
Term 199,999,999 1 1 1 1 1 1 1 1

Hanging on so far? Good.

Incrementing the counters

  1. Read the counter value from the tail structure: count = tail.get(ordinal)
  2. Check if bit 7 is set: if (count & 128 == 128)
  3. If bit 7 is set, increment the head counter: head.inc(count & 127)
  4. If bit 7 is not set, increment the tail counter: tail.set(ordinal, count+1)

Pros and cons

In this example, the counters takes up 6 billion * 8 bits + 25 * 128 * 32 bits = 5.7GB. The performance overhead, compared to the PackedInts version, is tiny: Whenever a head bit is encountered, there will be an extra read to get the old head value before writing the value+1. As head will statistically be heavily used, it is likely to be in Level 2/3 cache.

This is just an example, but it should be quite realistic as approximate values from the URL field in our Net Archive Index has been used. Nevertheless, it must be stressed that the memory gains from long tail PackedInts is highly dictated by the shape of the long tail curve.

Afterthought

It is possible to avoid the extra bit in tail by treating the large terms as any other term, until their tail-counters reaches maximum (127 in the example above). When a counter’s max has been reached, the head-counter can then be located using a lookup mechanism, such as a small HashMap or maybe just a linear scan through a short array with the ordinals and counts for the large terms. This would reduce the memory requirements to approximately 6 billion * 7 bits = 5.3GB. Whether this memory/speed trade-off is better or worse is hard to guess and depends on result set size.

Implementation afterthought

The long tail PackedInts could implement the PackedInts-interface itself, making it usable elsewhere. Its constructor could take another PackedInts filled with maxCounts or a histogram with maxbit requirements.

Heureka update 2015-02-27

There is no need to mark the head terms in the tail structure up front. All the entries in tail acts as standard counters until the highest bit is set. At that moment the bits used for counting switches to be a pointer into the next available entry in the head counters. The update workflow thus becomes

  1. Read the counter value from the tail structure: count = tail.get(ordinal)
  2. Check if bit 7 is set: if (count & 128 == 128)
  3. If bit 7 is set, increment the head counter: head.inc(count & 127)
  4. If bit 7 is not set, increment the tail counter: tail.set(ordinal, count+1)
  5. If the counter reaches bit 7, change the counter-bits to be pointer-bits:
    if ((count+1) == 128) tail.set(ordinal, 128 & headpos++)

Pros and cons

The new logic means that initialization and resetting of the structure is simply a matter of filling them with 0. Update performance will be on par with the current PackedInts implementation for all counters, whose value is within the cutoff. After that the penalty of an extra read is paid, but only for the overflowing values.

The memory overhead is unchanged from the long tail PackedInts implementation and still suffers from the extra bit used for signalling count vs. pointer.

Real numbers 2015-02-28

The store-pointers-as-values has the limitation that there can only be as many head counters as the maxCount for tail. Running the numbers on the URL-field for three of the shards in our net archive index resulted in bad news: The tip of the long tail shape was not very pointy and it is only possible to shave 8% of the counter size. Far less than the estimated 30%. The Packed64 in the table below is the current structure used by sparse faceting.

Shard 1 URL: 228M unique terms, Packed64 size: 371MB
tail BPV required memory saved head size
11 342MB 29MB / 8% 106
12 371MB 0MB / 0% 6

However, we are in the process of experimenting with faceting on links, which has quite a higher point in the long tail shape. From a nearly fully build test shard we have:

8/9 build shard links: 519M unique terms, Packed64 size: 1427MB
tail BPV required memory saved head size
15 1038MB 389MB / 27% 14132
16 1103MB 324MB / 23% 5936
17 1168MB 260MB / 18% 3129
18 1233MB 195MB / 14% 1374
19 1298MB 130MB / 9% 909
20 1362MB 65MB / 5% 369
21 1427MB 0MB / 0% 58

For links, the space saving was 27% or 389MB for the nearly-finished shard. To zoom out a bit: Doing faceting on links for our full corpus with stock Solr would take 50GB. Standard sparse faceting would use 35GB and long tail would need 25GB.

Due to sparse faceting, response time for small result sets is expected to be a few seconds for the links-facet. Larger result sets, not to mention the dreaded *:* query, would take several minutes, with worst-case (qualified guess) around 10 minutes.

Three-level long tail 2015-02-28

Previously:

  • pointer-bit: Letting the values in tail switch to pointers when they reach maximum has the benefit of very little performance overhead, with the downside of taking up an extra bit and limiting the size of head.
  • lookup-signal: Letting the values in tail signal “find the counter in head” when they reach maximum, has the downside that a sparse lookup-mechanism, such as a HashMap, is needed for head, making lookups comparatively slow.

New idea: Mix the two techniques. Use the pointer-bit principle until there is no more room in head. head-counters beyond that point all get the same pointer (all value bits set) in tail and their position in head is determined by a sparse lookup-mechanism ord2head.

This means that

  • All low-value counters will be in tail (very fast).
  • The most used high-value counters will be in head and will be referenced directly from tail (fast).
  • The less used high-value counters will be in head and will require a sparse lookup in ord2head (slow).

Extending the pseudo-code from before:

value = tail.get(ordinal)
if (value == 255) { // indirect pointer signal
  head.inc(ord2head.get(ordinal))
} else if (value & 128 == 128) { // pointer-bit set
  head.inc(value & 127)
} else { // term-count = value
  value++;
  if (value != 128) { // tail-value ok
    tail.set(value)
  } else { // tail-value overflow
    head.set(headpos, value)
    if (headpos < 127) { // direct pointer
      tail.set(128 & headpos++)
    } else { // indirect pointer
      tail.set(255)
      ord2head.put(ordinal, headpos++)
    }
  }
}

Exhaustive ID filters in Solr

February 4, 2015 by

Analysis of the material in our Net Archive index is challenging for many reasons. One of them is agreeing on what we’re looking at: It is rarely the full amount of data; our researchers wants to define subsets. Subsets can be sliced in different ways: At (W)ARC level, as manually selection of specific URLs harvested at specific times, as a graph of links etc. They can also be specified with Solr. Using Solr, there are two common and technically distinct ways of specifying subsets:

  • Subset definition by simple boolean expressions: “All material from the domain example.com harvested in 2011″ or “All PDF files that mentions ‘disneyland after dark’ and has a link to domain:d-a-d.dk”.
    Such subsets are extremely easy to work with in Solr, as they are just filters: They are very fast and works fully with faceting, sorting and grouping.
  • Subset definition by weighting: “One instance of all URLs, de-duplicated by taking the one as close to June 1., 2011 as possible”.
    For some functionality, such as extracting simple top-X results, this can be handled by grouping in Solr. Unfortunately this scales poorly with faceting, groups within groups is not a Solr feature and some functionality, such as counting the number of distinct groups (the result set size) is not possible to do reliably with SolrCloud.

So what can we do about the pesky subset-by-weighting? The idea is to add support for exhaustive ID filters, where exhaustive means potentially all documents at single-document granularity level. With such a filter we could define arbitrary subsets on an existing index, not even constrained by Solr’s own powers of expression. It looks a bit daunting with billions of documents, but given some constraints it seems doable.

Creating the document list

The first problem is to determine which documents should be in the filter. If the documents are defined outside of Solr, this could just be a list of [URL, timestamp] or similar parameters uniquely identifying the documents. If the documents are to be aggregated from the SolrCloud, some manual work might be needed.

Let’s see how to do it for “One instance of all URLs, de-duplicated by taking the one as close to June 1., 2011 as possible”. In a single-shard setup it might be possible to do it with grouping on URL (1 entry/group) and sorting by distance. As stated before, this scales badly or not at all with SolrCloud and things like faceting or paging through result sets. But the principle itself is simple:

  1. For each shard, extract all the documents as tuples of [URL, date, ID], sorted by URL. This is very easily done with deep paging. If further restrictions, such as “only documents from 2010-2012″ are needed, this is done by applying a filter before extraction.
  2. Process the resulting lists individually and extend the tuples with shard-ID to [URL, date, ID, shardID].
  3. Merge-sort the list of tuples, primarily by URL, secondarily by closeness to 2011-06-01.
  4. Prune the aggregated list by only keeping the first entry for each URL.
  5. Split the aggregated list into shard-specific lists, using the shard-ID part of the tuples, keeping only the IDs.

The astute reader will have noticed that steps 1-5 can be performed in a streaming manner: The only limit to the size of the resulting ID-lists is storage space.

Named filters

At this point we have a list of entries uniquely defining documents, for each shard. We need a Solr component that can

  1. Take a user-assigned filter-ID and an URL to the list of document-definitions.
  2. Create an empty docset/bitmap, the same kind a standard filter uses.
  3. For each entry in the list of document-definitions, resolve the internal docID for the document and set that bit in the bitmap.
  4. Store the resulting bitmap for later usage, just like a standard filter, with the user-assigned filter-ID as key.
  5. When the user issues queries specifying filter-IDs, the docID bitmaps are fetched and applied as a standard Solr filters, with the same minimal processing overhead.

There is a lot of potential for optimization here:

  • It would be possible for each line in the list of document-definitions to be viewed as a full-blown Solr query, seeing the whole list as a giant boolean OR query. Lines could be processed in batches by enclosing each line in the batch in parentheses and interleaving with OR, before processing it as a plain query.
  • Treating each line as a query allows the filter to be specified any the user wants: While the granularity of the list goes down to document level, it also goes up to the full index.
  • It would be possible to see each line as a term that must match a single field (normally the ID field). Again requests could be batched, this time using the very efficient TermsQuery.
  • For ID-based lists, pre-sorting the list would open up for very efficient processing: The Solr query mechanism could be bypassed and the internal structure for the ID-field could be traversed sequentially.

Once again this can be streamed (except for the sorting idea), with no limit on the size of the list of document-definitions. Theoretically this could be tied to the generation of the corpus, avoiding temporary storage of the lists. But that would be a later project.

What we get

  • The ability for researchers to specify shared subsets of any size at document granularity, without changing the indexes.
  • Besides relevance ranking, using such filters would be equivalent to building a whole new index from the defined subset.
  • High initial cost of defining and creating filters, but extremely low subsequent overhead of using them.

Caveats

Creating the document lists would likely be very heavy, but as the premise is to create subsets of the corpus, this would normally be done once per subset defined by the researchers.

Creating the filter itself would likely be less heavy, but the processing of hundreds of millions of tiny queries, even with batching, is measured in minutes or hours. Due to the nature of filters, this step must be repeated if the index changes, so exhaustive filters would only be feasible with a rarely-updated index.Some of Solr’s build-in functionality, such as the distance to a given date for the example case, would have to be duplicated in the external merge-sorter.

Prior work

Constraining searches by large lists of IDs is not a new idea.

  • Defining the IDs at query time is not feasible at net archive scale; even with bit packing a request could potentially be measured in gigabytes.
  • Treating the task as a security filter problem holds promises, at least for the filtering part.
  • Adding a token to the documents in a subset would work, but it makes experimentation very heavy and (of course) requires the indexes to be updated.
  • The Terms Filter Lookup in Elasticsearch seems quite like the persistent filter described above. As does SOLR-1715, which unfortunately is talk-only at this point.
  • Update 20140205: The example of generating the document list seems to fit extremely well with Heliosearch’s Streaming Aggregation, which will hopefully be part of Solr with SOLR-7082.

British Library and IIPCTech15

February 2, 2015 by

For a change of pace: A not too technical tale of my recent visit to England.

The people behind IIPC Technical Training Workshop – London 2015 had invited yours truly as a speaker and participant in the technical training. IIPC stands for International Internet Preservation Consortium and I were to talk about using Solr for indexing and searching preserved Internet resources. That sounded interesting and Statsbiblioteket encourages interinstitutional collaboration, so the invitation was gladly accepted. Some time passed and British Library asked if I might consider arriving a few days early and visit their IT development department? Well played, BL, well played.

I kid. For those not in the know, British Library made the core software we use for our Net Archive indexing project and we are very thankful for that. Unfortunately they do have some performance problems. Spending a few days, primarily talking about how to get their setup to work better, was just reciprocal altruism working. Besides, it turned out to be a learning experience for both sides.

It is the little things, like the large buses

It is the little things, like the large buses

At British Library, Boston Spa

The current net archive oriented Solr setups at British Library is using SolrCloud with live indexes on machines with spinning drives (aka harddisks) and a – relative to index size – low amount of RAM. At Statsbiblioteket, our experience tells us that such setups generally have very poor performance. Gil Hoggarth and I discussed Solr performance at length and he was tenacious on exploring every option available. Andy Jackson partook in most of the debates. Log file inspections and previous measurements from the Statsbiblioteket setups seemed to sway them in favour of different base hardware, or to be specific: Solid State Drives. The open question is how much such a switch would help or if it would be a better investment to increase the amount of free memory for caching.

  • A comparative analysis of performance with spinning drives vs. SSDs for multi-TB Solr indexes on machines with low memory would help other institutions tremendously, when planning and designing indexing solutions for net archives.
  • A comparative analysis of performance with different amounts of free memory for caching, as a fraction of index size, for both spinning drives and SSDs, would be beneficial on a broader level; this would give an idea of how to optimize bang-for-the-buck.
Illuminate the roads ahead

Illuminate the road ahead

Logistically the indexes at British Library are quite different from the index at Statsbiblioteket: They follow the standard Solr recommendation and treats all shards as a single index, both for index and search. At Statsbiblioteket, shards are build separately and only treated as a whole index at search time. The live indexes at British Library have some downsides, namely re-indexing challenges, distributed indexing logistics overhead and higher hardware requirements. They also have positive features, primarily homogeneous shards and the ability to update individual documents. The updating of individual documents is very useful for tracking meta-data for resources that are harvested at different times, but have unchanged content. Tracking of such content, also called duplicate handling, is a problem we have not yet considered in depth at Statsbiblioteket. One of the challenges of switching to static indexes is thus:

  • When a resource is harvested multiple times without the content changing, it should be indexed in such a way that all retrieval dates can be extracted and such that the latest (and/or the earliest?) harvest date can be used for sorting, grouping and/or faceting.

One discussed solution is to add a document for each harvest date and use Solr’s grouping and faceting features to deliver the required results. The details are a bit fluffy as the requirements are not strictly defined.

At the IIPC Technical Training Workshop, London 2015

The three pillars of the workshop were harvesting, presentation and discovery, with the prevalent tools being Heritrix, Wayback and Solr. I am a newbie in two thirds of this world, so my outsider thoughts will focus on discovery. Day one was filled with presentations, with my Scaling Net Archive Indexing and Search as the last one. Days two and three were hands-on with a lot of discussions.

As opposed to the web archive specific tools Heritrix and Wayback, Solr is a general purpose search engine: There is not yet a firmly established way of using Solr to index and search net archive material, although the work from UKWA is a very promising candidate. Judging by the questions asked at the workshop, large scale full-text search is relatively new in the net archive world and as such the community lacks collective experience.

Two large problems of indexing net archive material is analysis and scaling. As stated, UKWA has the analysis part well in hand. Scaling is another matter: Net archives typically contains billions of documents, many of them with a non-trivial amount of indexable data (webpages, PDFs, DOCs etc). Search responses ideally involve grouping or faceting, which requires markedly more resources than simple search. Fortunately, at least from a resource viewpoint, most countries does not allow harvested material to be made available to the general public: The number of users and thus concurrent requests tend to be very low.

General recommendations for performant Solr systems tend to be geared towards small indexes or high throughput, minimizing the latency and maximizing the number of requests that can be processed by each instance. Down to Earth, the bottleneck tend to be random reads from the underlying storage, easily remedied by adding copious amounts of RAM for caching. While the advice arguable scales to net archive indexes in the multiple TB-range, the cost of terabytes of RAM, as well as the number of machines needed to hold them, is often prohibitive. Bearing in mind that the typical user groups on net archives consists of very few people, the part about maximizing the number of supported requests is overkill. With net archives as outliers in the Solr world, there is very little existing shared experience to provide general recommendations.

  • As hardware cost is a large fraction of the overall cost of doing net archive search, in-depth descriptions of setups are very valuable to the community.
All different, yet the same

All different, yet the same

Measurements from British Library as well as Statsbiblioteket shows that faceting on high cardinality fields is a resource hog when using SolrCloud. This is problematic for exploratory use of the index. While it can be mitigated with more hardware or software optimization, switching to heuristic counting holds promises of very large speed ups.

  • The performance benefits and the cost in precision of approximate search results should be investigated further. This area is not well-explored in Solr and mostly relies on custom implementations.

On the flipside of fast exploratory access is the extraction of large result sets for further analysis. SolrCloud does not scale for certain operations, such as deep paging within facets and counting of unique groups. Certain operations, such as percentiles in the AnalyticsComponent, are not currently possible. As the alternative to using the index tend to be very heavy Hadoop processing of the raw corpus, this is an area worth investing in.

  • The limits of result set extractions should be expanded and alternative strategies, such as heuristic approximation and per-shard processing with external aggregation, should be attempted.

On a personal note

Visiting British Library and attending the IIPC workshop was a blast. Being embedded in tech talk with intelligent people for 5 days was exhausting and very fulfilling. Thank you all for the hospitality and for pushing back when my claims sounded outrageous.

Finding the bottleneck

January 6, 2015 by

While our Net Archive search performs satisfactory, we would like to determine how well-balanced the machine is. To recap, it has 16 cores (32 with Hyper Threading), 256GB RAM and 25 Solr shards @ 900GB. When running it uses about 150GB for Solr itself, leaving 100GB memory for disk cache.

Test searches are for 1-3 random words from a Danish dictionary, with faceting on 1 very large (billions of unique values, billions of references), 2 large fields (millions of unique values, billions of references) and 3 smaller fields (thousands of unique values, billions of references). Unless otherwise noted, searches were issued one request at a time.

Scaling cores

Under Linux it is quite easy to control which cores a process utilizes, by using the command taskset. We tried scaling that by doing the following with different cores:

  1. Shut down all Solr instances
  2. Clear the disk cache
  3. Start up all Solr instances, limited to specific cores
  4. Run the standard performance test

In the chart below, ht means that there is the stated number of cores + their Hyper Threaded counterpart. In other words, 8ht means 8 physical cores but 16 virtual ones.

Performance with different number of cores

Scaling number of cores and use of Hyper Threading

Observations:

  • Hyper Threading does provide a substantial speed boost.
  • The differences between 8ht cores and 16 or 16ht cores are not very big.

Conclusion: For standard single searches, which is the design scenario, 16 cores seems to be overkill. More complex queries would likely raise the need for CPU though.

Scaling shards

Changing the number of shards on the SolrCloud setup was simulated by restricting queries to run on specific shards, using the argument shards. This was not the best test as it measured the combined effect of the shard-limitation and the percentage of the index held in disk cache; e.g. limiting the query to shard 1 & 2 meant that about 50GB of memory would be used for disk cache per shard, while limiting to shard 1, 2, 3, & 4 meant only 25GB of disk cache per shard.

Scaling with shard count

Scaling number of active shards

Note: These tests were done on performance degraded drives, so the actual response times are too high. The relative difference should be representative enough.

Observations:

  • Performance for 1-8 shards is remarkably similar.
  • Going from 8 to 16 shards is 100% more data at half performance.
  • Going from 16 to 24 shards is only 50% more data, but also halves performance.

Conclusion: Raising the number of shards further on an otherwise unchanged machine would likely degrade performance fast. A new machine seems like the best way to increase capacity, the less guaranteed alternative being more RAM.

Scaling disk cache

A Java program was used to reserve part of the free memory, by allocating a given amount of memory as long[] and randomly changing the content. This effectively controlled the amount of memory available for disk cache for the Solr instances. The Solr instances were restarted and the disk cache cleared between each test.

Scaling memory

Scaling free memory for disk cache

Observations:

  • A maximum running time of 10 minutes was far too little for this test, leaving very few measuring points for 54GB, 27GB and 7GB disk cache.
  • Performance degrades exponentially when the amount of disk cache drops below 100GB.

Conclusion: While 110GB (0.51% of the index size) memory for disk cache delivers performance well within our requirements, it seems that we cannot use much of the free memory for other purposes. It would be interesting to see how much performance would increase with even more free memory, for example by temporarily reducing the number of shards.

Scaling concurrent requests

Due to limited access, we only need acceptable performance for one search at a time. Due to the high cardinality (~6 billion unique values) URL-field, the memory requirements for a facet call is approximately 10GB, severely limiting the maximum number of concurrent requests. Nevertheless, it is interesting to see how much performance changes when the number of concurrent requests rises. To avoid reducing the disk cache, we only tested with 1 and 2 concurrent requests.

Scaling concurrent requests

Scaling concurrent request with heavy faceting

Observations (over multiple runs; only one run in shown in the graph):

  • For searches with small to medium result sets (aka “normal” searches), performance for 2 concurrent requests was nearly twice as bad as for 1 request.
  • For searches with large result sets, performance for 2 concurrent requests were more than twice as bad as for 1 request. This is surprising as a slightly better than linear performance drop were expected.

Conclusion: Further tests seems to be in order due to the surprisingly bad scaling. One possible explanation is that memory speed is the bottleneck. Limiting that number to 1 or 2 and queuing further requests is the best option for maintaining a stable system due to memory overhead.

Scaling requirements

Admittedly, the whole facet-on-URL-thing might not be essential for the user experience. If we avoid faceting on that field and only facet on the more sane fields, such as host, domain and 3 smaller fields, we can turn up the number of concurrent requests without negative impact on disk cache.

Scaling concurrency

Scaling concurrent requests with moderate faceting

Observations:

  • Mean performance with 1 concurrent request is 10 times better, compared to the full faceting scenario.
  • From 1-4 threads, latency increases (bad) and throughput improves (good).
  • From 4-32 threads, latency increases (bad) but throughput does not improve (double bad).

Conclusion: As throughput does not improve for more than 4 concurrent threads, using a limit of 4 seems beneficial. However, as we are planning to add faceting on links_domains and links_hosts, as well as grouping on URL, the measured performance is not fully representative of future use of search in the Net Archive.

Samsung 840 EVO degradation

December 28, 2014 by

Rumour has it that our 25 lovely 1TB Samsung 840 EVO drives in our Net Archive search machine does not perform well, when data are left untouched for months. Rumour in this case being solid confirmation with a firmware fix from Samsung. In our setup, index shards are written once, then left untouched for months or even years. Exactly the circumstances that trigger the performance degradation.

Measurements, please!

Our 25 shards are build over half a year, giving us an unique opportunity to measure drives in different states of decay. First experiment was very simple: Just read all the data from the drive sequentially by issuing cat index/* > /dev/null and plot the measured time spend with the age of the files on the x axis. That shows the impact on bulk read speed. Second experiment was to issue Solr searches to each shard in isolation, testing search speed one drive at a time. That shows the impact on small random reads.

Samsung 840 EVO performance degradationFor search as well as bulk read, lower numbers are better. The raw numbers for 7 drives follows:

Months 25% search median search 75% search 95% search mean search Bulk read hours
0.9 36 50 69 269 112 1.11
2.0 99 144 196 486 203 3.51
3.0 133 198 269 590 281 6.06
4.0 141 234 324 670 330 8.70
5.1 133 183 244 520 295 5.85
6.0 106 158 211 433 204 5.23
7.0 105 227 333 703 338 10.49

Inspecting the graph, it seems that search performance quickly gets worse until the data are about 4 months old. After that it stabilizes. Bulk reading on the other hand continue to worsen during all 7 months, but that has little relevance for search.

The Net Archive search uses SolrCloud for querying the 25 shards simultaneously and merging the result. We only had 24 shards at the previous measurement 6 weeks ago, but the results should still be comparable. Keep in mind that our goal is to have median response times below 2 seconds for all standard searches; searches matching the full corpus and similar are allowed to take longer.

Performance for full searchThe distinct hill is present both for the old and the new measurements: See Even sparse faceting is limited for details. But the hill has grown for the latest measurements; response times has nearly doubled for the slowest searches. How come it got that much worse during just 6 weeks?

Theory: In a distributed setup, the speed is dictated by the slowest shard. As the data gets older on the un-patched Samsung drives, the chances of having slow reads rises. Although the median response time for search on a shard with 3 month old data is about the same as one with 7 month old data, the chances of very poor performance searches rises. As the whole collection of drives got 6 weeks older, the chances of not having poor performance from at least one of the drives during a SolrCloud search fell.

Note how our overall median response time actually got better with the latest measurement, although the mean (average) got markedly worse. This is due to the random distribution of result set sizes. The chart paint a much clearer picture.

Well, fix it then!

The good news is that there is a fix from Samsung. The bad news is that we cannot upgrade the drives using the controller on the server. Someone has to go through the process of removing them from the machine and perform the necessary steps on a workstation. We plan on doing this in January and besides the hassle and the downtime, we foresee no problems with it.

However, as the drive bug is for old data, a rewrite of all the 25*900GB files should freshen the charges and temporarily bring them back to speed. Mads Villadsen suggested using dd if=somefile of=somefile conv=notrunc, so let’s try that. For science!

*crunching*

It took nearly 11 hours to process drive 1, which had the oldest data. That fits well with the old measurement of bulk speed for that drive, which was 10½ hour for 900GB. After that, bulk speed increased to 1 hour for 900GB. Reviving the 24 other drives was done in parallel with a mean speed of 17MB/s, presumably limited by the controller. Bulk read speeds for the reviewed drives was 1 hour for 900GB, except for drive 3 which took 1 hour and 17 minutes. Let’s file that under pixies.

Repeating the individual shard search performance test from before, we get the following results:Samsung 840 EVO performance reviving
Note that the x-axis is now drive number instead of data age. As can be seen, the drives are remarkably similar in performance. Comparing to the old test, they are at the same speed as the drive with 1 month old data, indicating that the degradation sets in after more than 1 month and not immediately. The raw numbers for the same 7 drives as listed in the first table are:

Months 25% search median search 75% search 95% search mean search Bulk read hours
0.3 34 52 69 329 106 1.00
0.3 41 55 69 256 104 1.00
0.3 50 63 85 330 131 1.00
0.3 39 58 77 301 108 1.00
0.3 40 57 74 314 106 1.01
0.3 37 50 66 254 96 1.01
0.3 24 33 51 344 98 1.00

Running the full distributed search test and plotting the results together with the 1½ month old measurements as well as the latest measurements with the degraded drives gives us the following.

Samsung 840 EVO performance reviving

Performance is back to the same level as 1½ month ago, but how come it is not better than that? A quick inspection of the machine revealed that 2 backup jobs had started and were running during the last test; it is unknown how heavy that impact is on the drives, so the test will be re-run when the backups has finished.

Conclusion

The performance degradation of non-upgraded Samsung 840 EVO drives is very real and the degradation is serious after a couple of months. Should you own such drives, it is highly advisable to apply the fixes from Samsung.

Changing field type in Lucene/Solr

December 15, 2014 by

The problem

We have 25 shards of 900GB / 250M documents. It took us 25 * 8 days = half a year to build them. Three fields did not have DocValues enabled when we build the shards:

  • crawl_date (TrieDateField): Unknown number of unique values, 256M values.
  • links_domains (multi value Strings): 3M unique values, 675M references.
  • links_hosts (multi value Strings): 6M unique values, 841M references.

We need DocValues on those fields for faceting. Not just because of speed and memory, but because Solr is technically unable to do faceting without it, at least on the links_domains & links_hosts fields: The internal structures for field cache faceting does not allow for the number of references we have in our index.

The attempted solution

Faced with the daunting task of re-indexing all shards, Hoss at Stump the Chump got the challenge of avoiding doing so. He suggested building a custom Lucene FilterReader with on-the-fly conversion, then using that to perform a full index conversion. Heureka, DVEnabler was born.

DVEnabler takes an index and a list of which fields to adjust, then writes a corrected index. It is still very much Here Be Dragons and requires the user to be explicit about how the conversion should be performed. Sadly the Lucene index format does not contain the required information for a more automatic conversion (see SOLR-6005 for a status on that). Nevertheless it seems to have reached first usable incarnation.

We tried converting one of our shards with DVEnabler. The good news is that it seemed to work: Our fields were converted to DocValues, we could perform efficient faceting and casual inspection indicated they had the right values. Proper test pending. The bad news is that the conversion took 2 days! For comparison, a non-converting plain optimize took just 8 hours.

Performance breakdown

Our initial shard building is extremely CPU-heavy: 8 days with 24 cores running 40 Tika-processes at 90%+ CPU utilization. The 8 real time days is 192 CPU core days. Solr merge/optimize is single-threaded, so the conversion to DocValues takes 2 CPU core days, or just 1/100 of the CPU resources needed for full indexing.

At the current time it is not realistic to make the conversion multi-threaded, to take advantage of the 24 cores. But it does mean that we can either perform multiple conversions in parallel or use the machine for building new shards, while conversing the old ones. Due to limited local storage, we can run 2 conversions in parallel, while moving unconverted & converted indexes to and from the machine. This gives us an effective conversion speed of 1 shard / 1 day.

SB IT Preservation at ApacheCon Europe 2014 in Budapest

November 19, 2014 by

Ok, actually only two of us are here. It would be great to have the whole department at the conference, then we could cover more tracks and start discussing, what we will be using next week ;-)

14 - 1

The first keynote was mostly introduction to The Apache Software Foundation along with some key numbers. The second keynote (in direct extension of the first) was an interview with best selling author Hugh Howey, who self-published ‘Wool’, in 2011. A very inspiring interview! Maybe I could be an author too – with a little help from you? One of the things he talked about was how he thinks

“… the future looks more and more like the past”

in the sense that storytelling in the past was collaborative storytelling around the camp fire. Today open source software projects are collaborative, and maybe authors should try it too? Hugh Howey’s book has grown with help from fans and fan fiction.

The coffee breaks and lunches have been great! And the cake has been plentiful!

Cake

Så skal Apache software foundations 15 års fødselsdag da fejres!

More cake!

Var der nogen som sagde at Ungarn var kendt for kager?

And yes, there has also been lots and lots of interesting presentations of lots and lots of interesting Apache tools. Where to start? There is one that I want to start using on Monday: Apache Tez. The presentation was by Hitesh Shah from Hortonworks and the slides are available online.

There are quite a few, that I want to look into a bit more and experiment with, such as Spark and Cascading, and I think my colleague can add a few more. There are some that we will tell our colleagues at home about, and hope that they have time to experiment… And now I’ll go and hear about Quadrupling your Elephants!

Note: most of the slides are online. Just look at http://events.linuxfoundation.org/events/apachecon-europe/program/slides.


Follow

Get every new post delivered to your Inbox.