Changing field type in Lucene/Solr

December 15, 2014 by

The problem

We have 25 shards of 900GB / 250M documents. It took us 25 * 8 days = half a year to build them. Three fields did not have DocValues enabled when we build the shards:

  • crawl_date (TrieDateField): Unknown number of unique values, 256M values.
  • links_domains (multi value Strings): 3M unique values, 675M references.
  • links_hosts (multi value Strings): 6M unique values, 841M references.

We need DocValues on those fields for faceting. Not just because of speed and memory, but because Solr is technically unable to do faceting without it, at least on the links_domains & links_hosts fields: The internal structures for field cache faceting does not allow for the number of references we have in our index.

The attempted solution

Faced with the daunting task of re-indexing all shards, Hoss at Stump the Chump got the challenge of avoiding doing so. He suggested building a custom Lucene FilterReader with on-the-fly conversion, then using that to perform a full index conversion. Heureka, DVEnabler was born.

DVEnabler takes an index and a list of which fields to adjust, then writes a corrected index. It is still very much Here Be Dragons and requires the user to be explicit about how the conversion should be performed. Sadly the Lucene index format does not contain the required information for a more automatic conversion (see SOLR-6005 for a status on that). Nevertheless it seems to have reached first usable incarnation.

We tried converting one of our shards with DVEnabler. The good news is that it seemed to work: Our fields were converted to DocValues, we could perform efficient faceting and casual inspection indicated they had the right values. Proper test pending. The bad news is that the conversion took 2 days! For comparison, a non-converting plain optimize took just 8 hours.

Performance breakdown

Our initial shard building is extremely CPU-heavy: 8 days with 24 cores running 40 Tika-processes at 90%+ CPU utilization. The 8 real time days is 192 CPU core days. Solr merge/optimize is single-threaded, so the conversion to DocValues takes 2 CPU core days, or just 1/100 of the CPU resources needed for full indexing.

At the current time it is not realistic to make the conversion multi-threaded, to take advantage of the 24 cores. But it does mean that we can either perform multiple conversions in parallel or use the machine for building new shards, while conversing the old ones. Due to limited local storage, we can run 2 conversions in parallel, while moving unconverted & converted indexes to and from the machine. This gives us an effective conversion speed of 1 shard / 1 day.

SB IT Preservation at ApacheCon Europe 2014 in Budapest

November 19, 2014 by

Ok, actually only two of us are here. It would be great to have the whole department at the conference, then we could cover more tracks and start discussing, what we will be using next week ;-)

14 - 1

The first keynote was mostly introduction to The Apache Software Foundation along with some key numbers. The second keynote (in direct extension of the first) was an interview with best selling author Hugh Howey, who self-published ‘Wool’, in 2011. A very inspiring interview! Maybe I could be an author too – with a little help from you? One of the things he talked about was how he thinks

“… the future looks more and more like the past”

in the sense that storytelling in the past was collaborative storytelling around the camp fire. Today open source software projects are collaborative, and maybe authors should try it too? Hugh Howey’s book has grown with help from fans and fan fiction.

The coffee breaks and lunches have been great! And the cake has been plentiful!

Cake

Så skal Apache software foundations 15 års fødselsdag da fejres!

More cake!

Var der nogen som sagde at Ungarn var kendt for kager?

And yes, there has also been lots and lots of interesting presentations of lots and lots of interesting Apache tools. Where to start? There is one that I want to start using on Monday: Apache Tez. The presentation was by Hitesh Shah from Hortonworks and the slides are available online.

There are quite a few, that I want to look into a bit more and experiment with, such as Spark and Cascading, and I think my colleague can add a few more. There are some that we will tell our colleagues at home about, and hope that they have time to experiment… And now I’ll go and hear about Quadrupling your Elephants!

Note: most of the slides are online. Just look at http://events.linuxfoundation.org/events/apachecon-europe/program/slides.

Sudden Solr performance drop

October 31, 2014 by

There we were, minding other assignments and keeping a quarter of an eye on our web archive indexer and searcher. The indexer finished its 900GB Solr shard number 22 and the searcher was updated to a total of 19TB / 6 billion documents. With a bit more than 100GB free for disk cache (or about 1/2 percent of total index size), things were relatively unchanged, compared to ~120GB free a few shards ago. We expected no problems. As part of the index update, an empty Solr was created as entry-point, with a maximum of 3 concurrent connections, to guard against excessive memory use.

But something was off. User issued searches seemed slower. Quite a lot slower for some of them. Time for a routine performance test and comparison with old measurements.

2565GB RAM, faceting on 6 fields, facet limit 25, for 12TB and 19TB of index

2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 12TB and 19TB of index

As the graph shows very clearly, response times rose sharply with the number of hits in a search in our 19TB index. At first glance that seems natural, but as the article Ten times faster explains, this should be a bell curve, not an ever-upgoing hill. The bell curve can be seen for the old 12TB index. Besides, those new response times were horrible.

Investigating the logs showed that most of the time was spend resolving facet-terms for fine-counting. There are hundreds of those for the larger searches and the log said it took 70ms for each, neatly explaining total response times of 10 or 20 seconds. Again, this would not have been surprising if we were not used to much better numbers. See Even sparse faceting is limited for details.

A Systems guy turned off swap, then cleared the disk cache, as disk cache clearing has helped us before in similar puzzling situations. That did not help this time: Even non-faceted searches had outliers above 10 seconds, which is 10 times worse than with the 12TB index.

Due to unrelated circumstances, we then raised the number of concurrent connections for the entry-point-Solr from 3 to 50 and restarted all Solr instances.

2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 12TB and 19TB of index, post Solr-restart

2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 12TB and 19TB of index, post Solr-restart

Welcome back great performance! You were sorely missed. The spread as well as the average for the 19TB index is larger than its 12TB counter part, but not drastically so.

So what went wrong?

  • Did the limiting of concurrent searches at the entry-Solr introduce a half-deadlock? That seems unlikely as the low-level logs showed the unusual high 70ms/term lookup-time, which is done without contact to other Solrs.
  • Did the Solr-restart clean up OS-memory somehow, leading to better overall memory performance and/or disk caching?
  • Were the Solrs somehow locked in a state with bad performance? Maybe a lot of garbage collection? Their Xmx is 8GB, which has been fine since the beginning: As each shard runs in a dedicated tomcat, the new shards should not influence the memory requirements of the Solrs handling the old ones.

We don’t know what went wrong and which action fixed it. If performance starts slipping again, we’ll focus on trying one thing at a time.

Why did we think clearing the disk cache might help?

It is normally advisable to use Huge Pages when running a large Solr server. Whenever a program requests memory from the operating system, this is done as pages. If the page size is small and the system has a lot of memory, there will be a lot of bookkeeping. It makes sense to use larger pages and have less bookkeeping.

Our indexing machine has 256GB of RAM, a single 32GB Solr instance and constantly starts new Tika processes. Each Tika process takes up to 1GB of RAM and runs for an average of 3 minutes. 40 of these are running at all times, so at least 10GB of fresh memory is requested from the operating system each minute.

We observed that the indexing speed of the machine fell markedly after some time, down to 1/4th of the initial speed. We also observed that most of the processing time was spend in kernel space (the %sy in a Linux top). Systems theorized that we had a case of OS memory fragmentation due to the huge pages. They tried flushing the disk cache (echo 3 >/proc/sys/vm/drop_caches) to reset part of the memory and performance restored.

A temporary fix of clearing the disk cache worked fine for the indexer, but the lasting solution for us was to disable the use of huge pages on that server.

The searcher got the same no-huge-pages treatment, which might have been a mistake. Contrary to the indexer, the searcher rarely allocates new memory and as such looks like an obvious candidate for using huge pages. Maybe our performance problems stemmed from too much bookkeeping of pages? Not fragmentation as such, but simply the size of the structures? But why would it help to free most of the memory and re-allocate it? Does size and complexity of the page-tracking structures increase with use, rather than being constant? Seems like we need to level up in Linux memory management.

Note: I strongly advice against using repeated disk cache flushing as a solution. It is symptom curing and introduces erratic search performance. But it can be very useful as a poking stick when troubleshooting.

On the subject of performance…

The astute reader will have noticed that the performance-difference is strange at the 10³ mark. This is because the top of the bell curve moves to the right as the number of shards increases. See Even sparse faceting is limited for details.

In order to make the performance comparison apples-to-apples, the no_cache numbers were used. Between the 12TB and the 19TB mark, sparse facet caching was added, providing a slight speed-up to distributed faceting. Let’s add that to the chart:

2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 12TB and 19TB of index, post Solr-restart

2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 12TB and 19TB of index, post Solr-restart

 

Although the index size was increased by 50%, sparse facet caching kept performance at the same level or better. It seem that our initial half-dismissal of the effectiveness of sparse facet caching was not fair. Now we just need to come up with similar software improvements each month and we we will never need to buy new hardware.

Do try this at home

If you want to try this on your own index, simply use sparse solr.war from GitHub.

What is high cardinality anyway?

October 2, 2014 by

An attempt to explain sparse faceting and when to use it in not-too-technical terms. Sparse faceting in Solr is all about speeding up faceting on high-cardinality fields for small result sets. That’s a clear statement, right? Of course not. What is high, what is small and what does cardinality mean? Dmitry Kan has spend a lot of time testing sparse faceting with his high-cardinality field, without getting the promised performance increase. Besides unearthing a couple of bugs with sparse faceting, his work made it clear that there is a need for better documentation. Independent testing for the win!

What is faceting?

When we say faceting in this context, it means performing a search and getting a list of terms for a given field. The terms are ordered by their count, which is the number of times they are referenced by the documents in the search result. A classic example is a list of authors:

  The search for "fairy tale" gave 15 results «hits».

  Author «field»
   - H.C. Andersen «term» (12 «count»)
   - Brothers Grimm «term» (5 «count»)
   - Lewis Carroll «term» (3 «count»)

Note how the counts sums up to more than the number of documents: A document can have more than one reference to terms in the facet field. It can also have 0 references, all depending on the concrete index. In this case, there are either more terms than are shown or some of the documents have more than one author . There are other forms of faceting, but they will not be discussed here.

Under the hood

At the abstract level, faceting in Solr is quite simple:

  1. A list of counters is initialized. It has one counter for each unique term in the facet field in the full corpus.
  2. All documents in the result set are iterated. For each document, a list of its references to terms is fetched.
    1. The references are iterated and for each one, the counter corresponding to its term is increased by 1.
  3. The counters are iterated and the Top-X terms are located.
  4. The actual Strings for the located terms are resolved from the index.

Sparse faceting improves on standard Solr in two ways:

  • Standard Solr allocates a new list of counters in step 1 for each call, while sparse re-uses old lists.
  • Standard Solr iterates all the counters in step 3, while sparse only iterates the ones that were updated in step 2.

Distributed search is different

Distributed faceting in Solr adds a few steps:

  • All shards are issued the same request by a coordinating Solr. They perform step 1-4 above and returns the results to the coordinator.
  • The coordinator merges the shard-responses into one structure and extracts the Top-X terms from that.
  • For each Top-X term, its exact count is requested from the shards that did not deliver it as part of step a.

Standard Solr handles each exact-count separately by performing a mini-search for the term in the field. Sparse reuses the filled counters from step 2 (or repeats step 1-2 if the counter has been flushed from the cache) and simply locates the counters corresponding to the terms. Depending on the number of terms, sparse is much faster (think 5-10x) than standard Solr for this task. See Ten times faster for details.

What is cardinality?

Down to earth, cardinality just means how many there are of something. But what thing? The possibilities for faceting are many: Documents, fields, references and terms. To make matters worse, references and terms can be counted for the full corpus as well as just the search result.

  • Performance of standard Solr faceting is linear to the number of unique terms in the corpus in step 1 & 3 and linear to the number of references in the search result in step 2.
  • Performance of sparse faceting is (nearly) independent of the number of unique terms in the corpus and linear to the number of references in the search result in step 2 & 3.

Both standard Solr and sparse treats each field individually, so they both scale linear for that. The documents returned as part of base search are represented in a sparse structure itself (independent of sparse faceting) and scales with result set size. While it does take time to iterate over these documents, this is normally dwarfed by the other processing steps. Ignoring the devils in the details: Standard Solr facet performance scales with the full corpus size as well as the result size, while sparse faceting scales just with the result size.

Examples please!

  • For faceting on URL in the Danish Web Archive, cardinality is very high for documents (5bn), references (5bn) and terms (5bn) in the corpus. The overhead of performing a standard Solr faceting call is huge (hundreds of milliseconds), due to the high number of terms in the corpus. As the typical search results are quite a lot smaller than the full corpus, sparse faceting is very fast.
  • For faceting on host in the Danish Web Archive, cardinality is very high for documents (5bn) and references (5bn) in the corpus. However, the number of  terms (1.3m) is more modest. The overhead of performing a standard Solr faceting call is quite small (a few milliseconds), due to the modest number of terms; the time used in step 2, which is linear to the references, is often much higher than the overhead. Sparse faceting is still faster in most cases, but only by a few milliseconds. Not much if the total response time is hundreds of milliseconds.
  • For faceting on content_type_norm in the Danish Web Archive, cardinality is very high for documents (5bn) and references (5bn) in the corpus. It is extremely small for the number of unique terms, which is 10. The overhead of performing a standard Solr faceting call is practically zero; the time used in step 2, which is linear to the references, is often much higher than the overhead. Sparse faceting is never faster than Solr for this and as a consequence falls back to standard counting, making it perform at the same speed.
  • For faceting on author at the library index at Statsbiblioteket, cardinality is high for documents (15m), references (40m) and terms (9m) in the corpus. The overhead of performing a standard Solr faceting call is noticeable (tens of milliseconds), due to the 9m terms in the corpus. The typical search results is well below 8% of the full corpus, and sparse faceting is markedly faster than standard Solr. See Small Scale Sparse Faceting for details.

Sparse facet caching

September 19, 2014 by

As explained in Ten times faster, distributed faceting in standard Solr is two-phase:

  1. Each shard performs standard faceting and returns the top limit*1.5+10 terms. The merger calculates the top limit terms. Standard faceting is a two-step process:
    1. For each term in each hit, update the counter for that term.
    2. Extract the top limit*1.5+10 terms by running through all the counters with a priority queue.
  2. Each shard returns the number of occurrences of each term in the top limit terms, calculated by the merger from phase 1. This is done by performing a mini-search for each term, which takes quite a long time. See Even sparse faceting is limited for details.
    1. Addendum: If the number for a term was returned by a given shard in phase 1, that shard is not asked for that term again.
    2. Addendum: If the shard returned a count of 0 for any term as part of phase 1, that means is has delivered all possible counts to the merger. That shard will not be asked again.

Sparse speedup

Sparse faceting speeds up phase 1 step 2 by only visiting the updated counters. It also speeds up phase 2 by repeating phase 1 step 1, then extracting the counts directly for the wanted terms. Although it sounds heavy to repeat phase 1 step 1, the total time for phase 2 for sparse faceting is a lot lower than standard Solr. But why repeat phase 1 step 1 at all?

Caching

Today, caching of the counters from phase 1 step 1 was added to Solr sparse faceting. Caching is tricky business to get just right, especially since the sparse cache must contain a mix of empty counters (to avoid re-allocation of large structures on the Java heap) as well as filled structures (from phase 1, intended for phase 2). But theoretically, it is simple: When phase 1 step 1 is finished, the counter structure is kept and re-used in phase 2. So time for testing:

15TB index / 5B docs / 2565GB RAM, faceting on 6 fields, facet limit 25

15TB index / 5B docs / 2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed queries

Note that there are no measurements of standard Solr faceting in the graph. See the Ten times faster article for that. What we have here are 4 different types of search:

  • no_facet: Plain searches without faceting, just to establish the baseline.
  • skip: Only phase 1 sparse faceting. This means inaccurate counts for the returned terms, but as can be seen, the overhead is very low for most searches.
  • cache: Sparse faceting with caching, as described above.
  • nocache: Sparse faceting without caching.

Observations

For 1-1000 hits, nocache is actually a bit faster than cache. The peculiar thing about this hit-range is that chances are high that all shards returns all possible counts (phase 2 addendum 2), so phase 2 is skipped for a lot of searches. When phase 2 is skipped, this means wasted caching of a filled counter structure, that needs to be either cleaned for re-use or discarded if the cache is getting too big. This means a bit of overhead.

For more than 1000 hits, cache wins over nocache. Filter through the graph noise by focusing on the medians. As the difference between cache and nocache is that the base faceting time is skipped with cache, the difference of their medians should be the about the same as the difference of the medians from no_facet and skip. Are they? Sorta-kinda. This should be repeated with a larger sample.

Conclusion

Caching with distributed faceting means a small performance hit in some cases and a larger performance gain in other. Nothing Earth-shattering and as it works best when there is more memory allocated for caching, it is not clear in general whether it is best to use it or not. Download a Solr sparse WAR from GitHub and try for yourself.

Even sparse faceting is limited

September 11, 2014 by

Recently, Andy Jackson from UK Web Archive discovered a ginormous Pit Of Pain with Solr distributed faceting, where some response times reached 10 minutes. The culprit is facet.limit=100 (the number of returned values for each facet is 100), as the secondary fine-counting of facet terms triggers a mini-search for each term that has to be checked. With the 9 facets UK Web Archive uses, that’s 9*100 searches in the worst-case. Andy has done a great write-up on their setup and his experiments: Historical UKWA Solr Query Performance Benchmarking.

Pit Of Pain by Andy Jackson, UK Web Archive

Pit Of Pain by Andy Jackson, UK Web Archive

The shape of the pit can be explained by the probability of the need for fine-counts: When there is less than 1K hits, chances are that all shards has delivered all matching terms with count > 0 and thus need not be queried again (clever merging). When there are more than 1M hits, chances are that the top-100 terms in each facet are nearly the same for all shards, so that only a few of the terms needs fine-counting. Between those two numbers, chances are that a lot of the terms are not present in all initial shard results and thus require fine-counting.

While the indexes at Statsbiblioteket and UK Web Archive are quite comparable; 12TB vs. 16TB, build with nearly the same analysis chain, the setups differ with regard to hardware as well as facet setup. Still, it would be interesting to see if we can reproduce the Pit Of Pain™ with standard Solr faceting on our 6 facet fields and facet.limit=100.

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100

Sorta, kinda? We do not have the low response times < 100 hits, and 10 minutes testing only gave 63 searches, but with the right squinting of eyes, the Pit Of Pain (visualized as a hill to trick the enemy) is visible from ~1K to 1M hits. As for the high response times < 100 hits, it is due to a bad programming decision from my side – expect yet another blog post. As for the pit itself, let’s see how it changes when the limit goes down.

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100, 50 and 5

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100, 50 & 5

Getting a little crowded with all those dots, so here’s a quartile plot instead.

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100, 50 and 5

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100, 50 & 5

Again, please ignore results below 100 hits. I will fix it! Promise! But other than that, it seems pretty straight forward: High limits has a severe performance penalty, which seems to be more or less linear to the limit requested (hand waving here).

The burning question is of course how it looks with sparse faceting. Technically, distributed sparse faceting avoids the mini-searches in the fine-counting phase, but still requires each term to be looked up in order to resolve its ordinal (it is used as index in the internal sparse faceting counter structure). Such a lookup does take time, something like 0.5ms on average on our current setup, so sparse faceting is not immune to large facet limits. Let’s keep the y-axis-max of 20 seconds for comparison with standard Solr.

12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 100, 50 & 5

12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 100, 50 & 5

There does appear to be a pit too! Switching to quartiles and zooming in:

12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 100, 50 & 5

12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 100, 50 & 5

sparse_limit_10min_12.5TB_4.2B_sparse_finecount_l1000_100_50_5_nomax.png

This could use another round of tests, but it seems that the pit is present from 10K to 1M hits, fairly analogue to Solr fc faceting. The performance penalty of high limits also matches, just an order of magnitude lower. With worst-case of 6*100 fine-counts (with ~10^5 hits) on each shard and an average lookup time of ½ms, having a mean for the total response time around 1000ms seems reasonable. Everything checks out and we are happy.

Update 20140912

The limit for each test were increased to 1 hour or 1000 searches, whichever comes first, and the tests repeated with facet.limits of 1K, 10K and 100K. The party stopped early with OutOfMemoryError for 10K and since raising the JVM heap size skews all previous results, what we got is what we have.

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 1000

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 1000

Quite similar to the Solr fc faceting test with facet.limit=100 at the beginning of this post, but with the Pit Of Pain moved a bit to the right and a worst-case of 3 minutes. Together with the other tested limits and quartiled, we have

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 1000, 100, 50 & 5

Looking isolated at the Pit Of Pain, we have the median numbers

facet.limit 10^4 hits 10^5 hits 10^6 hits 10^7 hits
1000 24559 70061 141660 95792
100 9498 16615 12876 11582
50 9569 9057 7668 6892
5 2469 2337 2249 2168

Without cooking the numbers too much, we can see that the worst increase switching from limit 50 to 100 is for 10^5 hits: 9057ms -> 16615ms or 1.83 times, with the expected increase being 2 (50 -> 100). Likewise the worst increase from limit 100 to 1000 is for 10^6 hits: 12876ms -> 141660ms or 11.0 times, with the expected increase being 10 (100 -> 1000). In other words: Worst-case median response times (if such a thing makes sense) for distributed fc faceting with Solr scales lineary to the facet.limit.

Repeating with sparse faceting and skipping right to the quartile plot (note that the y-axis dropped by a factor 10):

    12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 1000, 100, 50 & 5

12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 1000, 100, 50 & 5

Looking isolated at the Pit Of Pain, we have the median numbers

facet.limit 10^4 hits 10^5 hits 10^6 hits 10^7 hits
1000 512 2397 3311 2189
100 609 960 698 939
50 571 635 395 654
5 447 215 248 588

The worst increase switching from limit 50 to 100 is for 10^6 hits: 395ms -> 698ms or 1.76 times, with the expected increase being 2. Likewise the worst increase from limit 100 to 1000 is also for 10^6 hits: 698ms -> 3311ms or 4.7 times, with the expected increase being 10. In other words: Worst-case median response times for distributed sparse faceting appears to scale better than lineary to the facet.limit.

Re-thinking this, it becomes apparent that there are multiple parts to facet fine-counting: A base overhead and an overhead for each term. Assuming the base overhead is the same, since the number of hits is so, we calculate this to 408ms and the overhead per term to 0.48ms for sparse (remember we have 6 facets so facet.limit=1000 means a worst-case of fine-counting 6000 terms). If that holds, setting facet.limit=10K would have a worst-case median response time of around 30 seconds.

A few Highlights from the Digital Libraries #DL2014 conference in London

September 10, 2014 by

This post is written at the DL2014 conference in London in September, where it’s warm and sunny :)

My visit started at the Digital Preservation Sustainability on the EU Policy Level workshop on Monday. This workshop had an interesting panel debate and brought together a number of EU projects in the digital preservation area. Jette Junge will probably blog about this workshop on the opf blog following up on this post.

Tuesday morning at the main conference started with a Keynote by Prof. Dieter Fellner from the Frauenhofer Institute, who talked about 3D-Digitization of Cultural Heritage. He told us that 90% of museum artifacts are hidden in archives, and he proposed using 3D-digitization for preservation and restoration, and maybe replication? He talked of his vision of 3D Cultural Heritage Artifacts as First Class Citizens in Digital Libraries, and he showed us an example of Manual 3D Digitization of “paper” from the Easter Islands. He also showed us some cool 3D scanner gadgets and asked the (rhetorical) question

Is this just implementation or is it research?

And he presented a number of research challenges:

  • Faster processing, Improved Workflows, Cost Reduction
  • Improve methods and standards for Geometry, Texture, Material Properties
  • Automated Markup and ….

There are already some projects working on speeding up the process: DOME, ORCAM, CultLab3D.de (industrial) and on serving the 3D models to the user using 3D internet, and X3DOM. There is the challenge of 3D Color Printing. And there are a number of Open Issues such as

  • Digital Rights
  • Signatures
  • Certification
  • Formats
  • Long Term Preservation

I see a bunch of interesting Research (and implementation) Challenges in this. I just find it difficult to put it into a State and University Library context ;-)

Next I went to (the important coffee break and next to) Session 1 on Preservation Strategies. This session had four interesting presentations. Note to self: acid.matkelly.com might be worth checking out (from the last presentation in this session Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript by Mat Kelly)!

At lunch time I went to the TPDL Steering Committee meeting (back stage in the dressing room for the women’s choir). It was a long meeting, but a good one. I’m confident that the TPDL conference is on the right path and in the right hands – at least for the immediate future.

In the afternoon I went to Session 4 on Building Systems. Also four interesting presentations. The second one was ours Bridging the Gap Between Real World Repositories and Scalable Preservation Environments presented by Per Møldrup-Dalum. It was with cool crowd-pleasing slides and the elephant in the library was a success again!

Then there was the Minute Madness: Poster and Demonstration Session Preview with lots of cool presentations and more people on stage than off it seemed. And the actual Poster and Demonstration Reception in the Garden Room.

The Garden

The Room

I think I’ll publish this now. I may write about today and tomorrow some time. Or I may tell some of you about it offline. Especially the State and University Library web guys. You should be here. Lots of presentations involving Solr and crowd sourcing. Will you join us next year?

 

Small scale sparse faceting

September 9, 2014 by

While sparse faceting has profound effect on response time in our web-archive, we are a bit doubtful about the amount of multi billion document Solr indexes out there. Luckily we also have our core index at Statsbiblioteket, which should be a bit more representative of your everyday Solr installation: Single-shard, 50GB, 14M documents. The bulk of the traffic are user-issued queries, which involves spellcheck, edismax qf & pf on 30+ fields and faceting on 8 fields. In this context, the faceting is of course the focus.

Of the 8 facet fields, 6 are low-cardinality and 2 are high-cardinality. Sparse was very recently enabled for the 2 high-cardinality ones, namely subject (4M unique values, 51M instances (note to self: 51M!? How did it get so high?)) and author (9M unique values, 40M instances).

To get representative measurements, the logged response times were extracted for the hours 07-22; there’s maintenance going on at night and it skews the numbers. Only user-entered searches with faceting were considered. To compare before- and after sparse-enabling, the data for this Tuesday and last Tuesday were used.

50GB / 14M docs, logged QTimes from production, without (20140902) and with (20140909) sparse faceting

50GB / 14M docs, logged timing from production, without (20140902) and with (20140909) sparse faceting

The performance improvement is palpable with response time being halved, compared to the non-sparse faceting. Fine-reading the logs, the time spend on faceting the high-cardinality fields is now in the single-digit milliseconds for nearly all queries. We’ll have to do some test to see what stops the total response time from getting down to that level. I am guessing spellcheck.

As always, sparse faceting is readily available for the adventurous at SOLR-5894.

Update 20140911

To verify that last Tuesday was not a lucky shot, here’s the numbers for the last 4 Wednesdays. Note that the amount of queries/day is fairly low for the first two weeks. This is due to semester start. Also note that the 10^8 hits (basically the full document set) were removed as those were all due to the same query being repeated by a dashboard tool.

50GB / 14M docs, logged timing from production. Only 20140909 is with sparse faceting

50GB / 14M docs, logged timing from production. Only 20140909 is with sparse faceting

Ten times faster

August 26, 2014 by

One week ago I complained about Solr’s two-phase distributed faceting being slow in the second phase – ten times slower than the first phase. The culprit was the fine-counting of top-X terms, with each term-count being done as an intersection between regular hits and a special search for the term in question.

Let’s have a quick look at the numbers from last week (note that the scales are logarithmic):

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

Imprecise facet counting aka cheating

The simple way to get fast distributed faceting for high cardinality fields is to skip the second phase and accept that the term counts for faceting might be a bit off, where “a bit” is highly dependent on corpus & query. An extremely quick ad-hoc test with our corpus suggested around 10% deviation from the correct counts. The skipping requires just 3 lines of code, strategically placed.

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

Apologies for the colors not matching between the charts. The median for plain Solr faceting is 1608 ms. For imprecise faceting counting, using sparse faceting for first phase, it is 168 ms. Quite a speed up! Read Whale hunting with Solr for an explanation of the weird response times below 100 hits.

Since we are already cheating and getting imprecise counts, we might as well limit the maximum count for each term. In our 12 shards, the maximum count for a single URL in any shard is a little below 5000, with the long tail very quickly dropping. Most counts are below 10 in a single shard. With a count of 5000, we need 13 bits to hold the counter, meaning 3.6 billion terms / 13 bits/term ~= 5.44 GB for all counter structures or about 0.45 GB / shard / thread. If we lower this max count to 255 / shard, so that a single counter fits in a byte, we get faster faceting and reduce the memory overhead to 3.6 GB total or 300 MB / shard / thread.

Alas, some of us think that all this cheating is getting out of hand…

Once more, with feeling!

It was possible to speed the first phase of Solr faceting by doing sparse counting, so let’s try that again: For the second phase, we do a near complete repetition of the first phase, so the counts for all terms in the facet field are calculated. However, instead of calculating and extracting the top-X terms, only the counts for the requested terms are extracted from the counting structure. Extraction of a count from the structure requires resolving of the ordinal for the term in question. This does take some time, but the hope was that this would not give too much overhead. So did it help?

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

 

This is getting a bit chaotic and it is hard to see all the cyan dots hiding between the green ones. Trusty old percentile plot to the rescue:

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting with sparse term lookup

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting with sparse term lookup

Now we can see! With the second phase of faceting being nearly as fast as first phase, total faceting time for small result sets is looking quite good. If we lump all the measurements for each method together, we get

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

Note how the median for skip_secondary is a lot lower than for the previous test run – it seems that someone else was using the machine at that time. The outputs has been verified by random inspection: It really only takes a few hundred milliseconds to facet on small result sets out of more than 3 billion documents on a single machine. Just as it should, one might add. It remains to be tested if smaller setups benefits just as much.

We’re not done yet

The implementation is not ideal as the exact same work – counting of all term occurrences – is very often done twice. It would be a lot better to cache it. When releasing a counter structure to the pool of structures, it could be released with a tag stating if it would probably be used again (first phase of distributed faceting) or if it would probably not be needed any more (after the second phase). Guesstimating, this should shave 30-40% of the total time for faceting with sparse term lookup.

Should anyone want to try sparse faceting for themselves, then visit https://github.com/tokee/lucene-solr and check out branch lucene_solr_4_8_sparse or lucene_solr_4_9_sparse. You will need an existing Solr index for proper testing. Refer to the file SparseKeys.java for options. The defaults works fine if the parameters facet.sparse=true and facet.sparse.termlookup=true are given and the requested facet field has over 10,000 unique values and facet.method=fc. To disable the second phase completely, add the parameter facet.sparse.skiprefinements=true. Proper documentation pending.

If you want to see this in Solr, visit SOLR-5894 and vote for it.

Update 20140826 22:26

To verify things, the experiment was repeated with 10 minute running time for each faceting method (as opposed to 3 minutes). This did not affect the conclusion, but might add a little bit of information.

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

The first thing to note is how response time rises near linear to result set size, when the result set is 5M or more. It would be interesting to investigate what happens around the 300M (8% of the corpus size) mark, which is the limit chosen for sparse faceting for this setup.

The second observation is that all methods except stock Solr (the blue dots) seems to have two clusters, one below 100ms and one above. As the no_facet method is single phase (note to self: Double check if this is true), this cannot be explained by the second phase being skipped. Maybe there is some caching effect? The queries should be as good as unique, so it is not just because of simple request caching.

For an alternative illustration, here’s the same data as above but without the logarithmic y-scale:

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

256GB RAM, 1 thread, 12 shards, 10TB, random words, faceting on URL, numbers from full distributed faceting

Ten times slower

August 15, 2014 by

I jumped the gun on our current web index status with our whale hunting safari, but it turns out that there are other fish to kill (sorry, whales are not fish and I’ll stop with the nautical metaphors immediately). This time we’re talking faceting on billions of documents.

At Statsbiblioteket we are building a SolrCloud web index of our harvested material. 9 months ago that was 372 TB or 10 billion entities. It has probably passed 400 TB by now. Our insane plan was to index it, put the indexes on a single machine and to do faceted searches. Because it made sense and maybe a little because it is pure awesome to handle that amount of information from a single box. Read about our plans in the post Webscale in Danish.

It has been about three months since we last checked how things were with our searchable web archive at Statsbiblioteket. Back then it contained 4 shards for a total of 3.6 TB or 1.2 billion documents, on our single 16 core machine with 256 GB of RAM. Simple searches were fast and faceting on URL (nearly unique for every document) equally so, even when we lowered the amount of RAM to 80 GB, which meant that only 1% of the total index data could be cached in RAM. The two graphs below illustrates the state back then.

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, Solr & sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

80GB RAM, 1 thread, 4 shards, 3.6TB, random words, sparse faceting on URL

Sparse faceting is markedly better than stock Solr under our test; performance is very satisfactory with most response times well below 1 second. It is important to note that the distributed faceting calls are executed with a single thread in each shard for this setup. This means that only 4 CPUs were fully utilized at a time during a single faceted search.

Yesterday we reached 12 shards for a total of 10.7 TB of index data in 3.6 billion documents. Turning off the artificial RAM limitation left us with 140 GB of RAM for disk cache or 1.3% of the full index size. So more RAM for cashing than we had with 4 shards and still plenty of CPU power as the machine has 16 cores (*2 if you count HyperThreading). Of course the merging gets a little heavier, but not too much. In an ideal world this would mean that the performance would be unchanged, right?

No, we did not think so either. So how bad is it?

 

256GB RAM, 1 thread, 12 shards, 10TB, random words, Solr & sparse faceting on URL

256GB RAM, 1 thread, 12 shards, 10TB, random words, Solr & sparse faceting on URL

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL

Ouch. 2½ second median? That’s bad! But wait a minute… Doesn’t that percentile plot look like a whale? And how come this is so much slower than our 4 shard setup and how come it is faster to facet on 100,000 values than it is to facet on 10,000? Time to check the logs.

Distributed Solr faceting is two-phase. First phase is standard faceting (find the top-X facet terms for a given search). The merger then collects the results, sums them and extracts the collective top-X terms. Second phase is to ask the shards for the counts for those terms, to get the correct count as the terms might not have been returned from all shards in the first phase. The merger is smarter than that, but the principle holds.

It seems logical that second phase is faster than first phase: It just has to calculate counts for a limited amount of specified terms instead of performing full facet processing. Let’s go back to the logs and plot the response times for first and second phase separately. Note that these numbers are from the 12 Solr logs at shard-request level and not at the full distributed call level.

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

256GB RAM, 1 thread, 12 shards, 10TB, random words, sparse faceting on URL, phase 1 and 2 separately, numbers from the individual shard requests

There goes that logical conclusion. The second phase takes more than 10 times as long as first phase! What is happening? We need to dig deeper and look at the surprisingly simple code for second phase:

private NamedList getListedTermCounts(String field, String termList, DocSet base) throws IOException {
  FieldType ft = searcher.getSchema().getFieldType(field);
  List<String> terms = StrUtils.splitSmart(termList, ",", true);
  NamedList<Integer> res = new NamedList<Integer>();
  for (String term : terms) {
    String internal = ft.toInternal(term);
   int count = searcher.numDocs(new TermQuery(new Term(field, internal)), base);
   res.add(term, count);
 }
 return res;
}

We have the result of the query in the bitset base; with shards of 300 million documents, that is a rather large bitset. For each term in the second phase facet request, a simple search is performed for facet_field:specific_term. This results in another bitset. The number of intersecting bits in these two sets is the count for the term.

The problem here is that we are doing intersections of very large bitsets. Potentially they can be represented by compact hashmaps or other structures, but the log tells us that this still takes quite a lot of time for a corpus of this size. Time that grows as the number of shards grows.

Guessing time: If at least one of the bitsets is a bitmap with 1 bit per document in the index, that takes up about 40 MB of heap, which is accessed when doing the intersection. If there are 20 terms in the request (quite likely as we ask for top-20 on URL), this is done 20 times. So a least 800MB of memory is accessed. With 12 shards doing faceting in parallel, this is just shy of 10 GB of memory. It is hard to measure memory wait time, but it seems a likely culprit that we are simple waiting for main memory.

The shape of the phase 2 plot is easily explainable: With 10 to 10,000 hits, all shards must provide counts for nearly the full set of terms. When we get above 100,000, the chances of any shard having already delivered the count for part of the top X terms rises; when it has already been delivered, it will not be requested again in the second phase so the workload gets smaller.

So what to do about that? The easiest thing is to skip the second phase completely. That would give us the great response times instantaneous at the cost of precision: The counts for the individual terms in the facet might be a bit off. But maybe there is a way to get the fast speed and the precise counts? The full facet counting out in the shards in the first phase was quite a lot faster, so if we do that again (or cache the old result), we have counts for all terms in the facet. For each specified term, we could get its ordinal (for example by binary search, meaning 36 lookups in our concrete ordinal-term map) and with that, we could get the count directly. Ordinal-term lookups are somewhat heavy as the index data on storage needs to be accessed, so it remains to be seen if this way of handling the second phase is faster than the standard one. Time to code again.


Follow

Get every new post delivered to your Inbox.