Which type bug?

A light tale of bug hunting an Out Of Memory problem with SolrCloud.

The setup and the problem

At the Royal Danish Library we provide full text search for the Danish Netarchive. The heavy lifting is done in a single collection SolrCloud made up of 107 shards (for a total of 94TB / 32 billion documents). All queries are issued to a Solr instance with an empty shard, with the sole responsibility of aggregating responses from the real shards.

One of the frontends is SolrWayback, which is a JavaScript application backed by a middle layer acting as an advanced proxy; issuing searches, rewriting HTML, doing streaming exports and so.

The problem this time was that the aggregating Solr node occasionally crashed with an Out Of Memory error, where occasionally means that it sometimes took months to crash, sometimes days.

Clues and analysis

  • Access to the Netarchive Search is strictly controlled, so there were no chance of denial of service or fimilar foul play.
  • Log analysis showed modest activity (a maximum of 9 concurrent searches) around the time of the latest crash.
  • The queries themselves were run-of-the-mill, but the crashing queries themselves were not logged, as Solr only logs the query when is has been completed, not when it starts.
  • The Garbage Collection logs showed that everything was a-ok, right up til the time when everything exploded in progressively longer collections, culminating in a 29 second stop-the-World and no heap space left.
Heap graph with stop-the-world GC as red triangles, courtesy of gceasy.io

Should be simple to pinpoint, right? And (plot twist) for once it was! Of course we chased the usual red herrings, but ultimately “dissect the logs around the problematic time slots” won the day.

Pop quiz: What is wrong with the log entries below? (meticulously unearthed from too many nearly-but-not-fully-similar entries and with timestamps adjusted to match graph-timezone).

1) 2020-06-04 19:29:06.285 INFO (qtp1908316405-618188) [c:ns0 s:shard1 r:core_node2 x:ns0_shard1_replica_n1] o.a.s.c.S.Request [ns0_shard1_replica_n1] webapp=/solr path=/select params={q=facebook.com&facet.field=domain&facet.field=content_type_norm&facet.field=type&facet.field=crawl_year&facet.field=status_code&facet.field=public_suffix&hl=on&indent=true&fl=id,score,title,hash,source_file_path,source_file_offset,url,url_norm,wayback_date,domain,content_type,crawl_date,content_type_norm,type&start=100&q.op=AND&fq=record_type:response+OR+record_type:arc&fq=domain:"facebook.com"+AND+crawl_year:"2015"&rows=20&wt=json&facet=true&f.crawl_year.facet.limit=100} hits=53532331 status=0 QTime=2181
2) 2020-06-04 19:33:32.418 INFO (qtp1908316405-619134) [c:ns0 s:shard1 r:core_node2 x:ns0_shard1_replica_n1] o.a.s.c.S.Request [ns0_shard1_replica_n1] webapp=/solr path=/select params={q=facebook.com&facet.field=domain&facet.field=content_type_norm&facet.field=type&facet.field=crawl_year&facet.field=status_code&facet.field=public_suffix&hl=on&indent=true&fl=id,score,title,hash,source_file_path,source_file_offset,url,url_norm,wayback_date,domain,content_type,crawl_date,content_type_norm,type&start=10020&q.op=AND&fq=record_type:response+OR+record_type:arc&fq=domain:"facebook.com"+AND+crawl_year:"2015"&rows=20&wt=json&facet=true&f.crawl_year.facet.limit=100} hits=53527106 status=0 QTime=6958
3) 2020-06-05 20:33:26.204 INFO (qtp1908316405-639768) [c:ns0 s:shard1 r:core_node2 x:ns0_shard1_replica_n1] o.a.s.c.S.Request [ns0_shard1_replica_n1] webapp=/solr path=/select params={q=facebook.com&facet.field=domain&facet.field=content_type_norm&facet.field=type&facet.field=crawl_year&facet.field=status_code&facet.field=public_suffix&hl=on&indent=true&fl=id,score,title,hash,source_file_path,source_file_offset,url,url_norm,wayback_date,domain,content_type,crawl_date,content_type_norm,type&start=10020&q.op=AND&fq=record_type:response+OR+record_type:arc&fq=domain:"facebook.com"+AND+crawl_year:"2017"&rows=20&wt=json&facet=true&f.crawl_year.facet.limit=100} hits=3785666 status=0 QTime=3650
4) 2020-06-05 20:34:36.078 INFO (qtp1908316405-641342) [c:ns0 s:shard1 r:core_node2 x:ns0_shard1_replica_n1] o.a.s.c.S.Request [ns0_shard1_replica_n1] webapp=/solr path=/select params={q=facebook.com&facet.field=domain&facet.field=content_type_norm&facet.field=type&facet.field=crawl_year&facet.field=status_code&facet.field=public_suffix&hl=on&indent=true&fl=id,score,title,hash,source_file_path,source_file_offset,url,url_norm,wayback_date,domain,content_type,crawl_date,content_type_norm,type&start=1002020&q.op=AND&fq=record_type:response+OR+record_type:arc&fq=domain:"facebook.com"+AND+crawl_year:"2017"&rows=20&wt=json&facet=true&f.crawl_year.facet.limit=100} hits=3781705 status=0 QTime=39489
5) 2020-06-05 20:43:25.303 INFO (qtp1908316405-639769) [c:ns0 s:shard1 r:core_node2 x:ns0_shard1_replica_n1] o.a.s.c.S.Request [ns0_shard1_replica_n1] webapp=/solr path=/select params={q=facebook.com&facet.field=domain&facet.field=content_type_norm&facet.field=type&facet.field=crawl_year&facet.field=status_code&facet.field=public_suffix&hl=on&indent=true&fl=id,score,title,hash,source_file_path,source_file_offset,url,url_norm,wayback_date,domain,content_type,crawl_date,content_type_norm,type&start=1002020&q.op=AND&fq=record_type:response+OR+record_type:arc&fq=domain:"facebook.com"+AND+crawl_year:"2018"&rows=20&wt=json&facet=true&f.crawl_year.facet.limit=100} hits=15355247 status=0 QTime=166414

If your answer was “Hey, what’s up with start!?” then you are now officially a Big Search Analyst. Your badge will arrive shortly.

For those not catching it (that included me for a long time):

  1. A search is issued with a query for facebook material from 2015 with the parameters start=100&rows=20 (corresponding to page 6 in a UI which shows 20 results/page). Response time is 2 seconds.
  2. The same query is repeated, this time with start=10020&rows=20. If the intent was to go to page 7 in the UI, we would expect start=120&rows=20. Response time is 7 seconds.
  3. The query is changed to facebook material from 2017, still with start=10020&rows=20. Seems like someone’s URL hacking. Response time is 3½ seconds.
  4. Same query as in #4, but now with start=1002020&rows=20. Response time jumps to 39 seconds.
  5. The query is changed to facebook material from 2018, with the previous start=1002020&rows=20 intact. Response time jumps to 166 seconds.

Locating the error

Time to inspect the code responsible for the paging:

if (this.start + 20 < this.totalHits) {
    this.start = this.start + 20;

Seems innocent enough and when we tested by pressing “Next” a few times in SolrWayback, it did indeed behave exemplary: start=0, start=20, start=40 and so on.

Looking further down we encounter this nugget:

/* Method used on creation, reload and route change to get query parameters */
    this.myQuery = this.$route.query.query;
    this.start= this.$route.query.start;
    this.filters = this.$route.query.filter;

A quick appliance of console.log(typeof this.start) in the right place tells us that when the UI page is reloaded, which happens when the URL is changed by hand, the type of this.start becomes a string!

Loosely typed languages is a taste not acquired by your humble author.

Back to the code for skipping to the next page: this.start = this.start + 20;
If this.start is 100 to begin with and if it is a string, we suddenly have "100" + 20, which JavaScript handles by casting the number 20 to the string 20: "100" + "20" = "10020". That translates to page 502 instead of page 2, which of course is not what the user wants, but how does it become a memory problem?

SolrCloud internals and the smoking gun

The SolrCloud for Netarchive Search is a distributed one (remember the 107 shards?), so when 20 documents starting at position 10020 are needed, the master must request start=0&rows=10040 document representations from each shard, sort them and deliver documents 10020-10039. For our setup that means holding up to 10040*107 = 1 million document representations in memory.

The master node has one job and this it it, so it handles the load. Yes, it bumps heap requirements temporarily with a gigabyte or two, but that’s okay. It still delivers the result in 7 seconds.

So what happens when the user presses Next again? Yes, "10020" + 20 = "1002020". That’s a factor 100 right there, as we move 2 decimal places. And master has -Xmx=8g… Fortunately the logged request only matched 15 million documents, so the master Solr got by with a 4GB bump to the heap (the first spike in the graph) at that time.

Knowing what to look for (start=xxxx, where xxxx is at least 4 digits), it is simple to find the last relevant log entry before the crash:
grep "start=[1-9][0-9][0-9][0-9]" solr.log.1

2020-06-08 08:29:43.898 INFO (qtp1908316405-709360) [c:ns0 s:shard1 r:core_node2 x:ns0_shard1_replica_n1] o.a.s.c.S.Request [ns0_shard1_replica_n1] webapp=/solr path=/select params={q=twitter.com&facet.field=domain&facet.field=content_type_norm&facet.field=type&facet.field=crawl_year&facet.field=status_code&facet.field=public_suffix&hl=on&indent=true&fl=id,score,title,hash,source_file_path,source_file_offset,url,url_norm,wayback_date,domain,content_type,crawl_date,content_type_norm,type&start=4020&q.op=AND&fq=record_type:response+OR+record_type:arc&fq=domain:"twitter.com"+AND+content_type_norm:"html"+AND+crawl_year:"2015"&rows=20&wt=json&facet=true&f.crawl_year.facet.limit=100} hits=53598240 status=0 QTime=3835

Here we have start=4020 and 54 million hits. The aggregating Solr died 10 minutes later. $10 says that the request that crashed the master Solr was for the same query, but with start=402020.

As 402020 document representations * 107 shards equals 43 million document representations, the master JVM might have survived with -Xmx=12g. If not for the huge amount of tiny objects overloading the garbage collector.

Fixes and take aways

Easy fix of course: Cast this.start in the JavaScript layer to integer and enforce an upper limit for start & rows in the middle layer for good measure.

For next time we’ve learned to

  • Closely examine slow queries (Captain Obvious says hello)
  • Keep GC-logs a few restarts back (we only had the current and the previous one to start with)
  • Plot the GC pauses to see if there are spikes that did not crash the JVM (trivial to do with gceasy.io), then inspect the request logs around the same time as the spikes

Posted in eskildsen, Solr | Tagged , | Leave a comment

Touching encouraged (an ongoing story)

A recurring theme at KB Labs is to show a lot of pixels. By chance we got our hands on a 4K touch-sensitive display, capable of showing a non-trivial amount of said pixels on a non-trivial surface area. Our cunning plan is to

  • Adapt some of the labs products to work on the display
  • Put the display somewhere in the public area of the library
  • Watch people swoon when they delve into beautiful cultural heritage data

This post is intended to be a diary of sorts, journaling what we learn.

Coincidental activation (2019-10-10)

We have talked about experimenting with interactive large displays for years. With emphasis on talked.

It took someone with youthful initiative to actually do something: Max Odsbjerg Pedersen discovered a usable & unused display and promptly sent us a video showing him using a labs product on the display. 4 days later he brokered a loan agreement and 10 days later we verified that no one questions two people removing a large display, as long as it is transported in a cardboard box.

Adding heavy box moving to résumé

Fair warning (2019-10-23)

The software development department has a – not entirely undeserved – reputation of being loose cannons that tend to muck about in ways that unexpectedly affects other departments.

To atone for blunders past and primarily because it really is the most constructive practice, representatives of the Cultural Heritage and the Communications departments were duly informed about the initiated process and invited to participate in discussion hereof. In other words: We met them at lunch as usual and said “Hey, we’ve got this nifty idea …”, to which they answered “Sounds good!”.

What have we got? (2019-10-24)

The display is a 55″ Samsung Flp. Its internal software seems focused on providing a digital flip-over with some connectivity possibilities? It does not have a build-in web browser, but connecting it to a Windows 10 machine is exceedingly easy. We will just have to duct tape a laptop to its back or something to that effect.

It was a pleasant surprise to discover that the excellent tool OpenSeadragon works perfectly out of the box with multi touch on desktop browsers: Tap, double tap, drag, pinch & spread. Well, as long as you are not a lazy developer that still use a pre-2017 version of OpenSeadragon where pinching is wonky *cough*.

Adapt, by which we mean “remove stuff” (2019-10-25)

Three KB Labs products, which would benefit from a large display, were selected:

At the core they are all web pages using OpenSeadragon and as such, adaptation mostly meant removing features and interface elements. A simple navigational area was added to switch between the products and the PoC Mark I alpha was born: Best viewed on a 55″ tablet or larger.

Secure, by which we mean “fail” (2019-10-25)

Since the whole thing is intended for public display & interaction, we want to make sure the users stay on the designated pages.

A developer navigating one of the designated pages

Pressing F11 switches to full screen with no navigational bar in Chrome and the end users does not have access to the keyboard, so problem solved? Our boss Bjarne Andersen passed by, stopped and played with the presentations. It took him 2 minutes to somehow activate right-click and presto, the box was cracked. Thanks boss!

Well, Chrome has a designated kiosk mode: Simply add -kiosk as an argument and all is taken care off. At least until co-worker Kim Christensen discovers that there is a handy swipe-from-a-vertical-edge gesture that opens the Windows menu and other elements. Cracked again. Thanks Kim!

Disabling swipe gestures did not seem possible without admin rights, which we do not have on the current computer. There seems to be a Windows kiosk mode that also requires admin rights. Oh well, maybe Monday? Weekend calls.

Broken Windows and tweaks (2019-10-28)

Colleague Thomas Egense brought a private windows laptop to work (no worries, we only connect those to the eduroam network).

  • It would not connect to the large display. Reboot.
  • It did connect to the display in 4K, but not to WiFi. Reboot.
  • It did connect to WiFi, but would no longer connect to the display. Reboot.
  • Same. Reboot.
  • Same. Give up.
  • Actively avoid defenestrating the laptop. Drink coffee.

At least it went a little better when colleague Gry Vindelev Elstrøm stopped by. She suggested adding some sort of map overlay, so that the users would not get lost in the big collages? And of course OpenSeadragon delivers again: 90 seconds and a reload later the wish were granted:

OpenSeadragon with Navigator overlay

Gry’s other wish: To have visual-similarity spatial grouping of the maps collection is both valid and entirely possible to fulfill. Buuut it is not a 90 second job and the touch screen project is a side project, so that idea was filed under when We Find The Time.

And then they were two (2019-10-29)

Heroic display digger Max Odsbjerg Pedersen phoned in and said he had found a twin display lying around. He’ll put it up somewhere at AU Library, Nobel Park, mirroring the display we’re working with at the Reoyal Danish Library, Aarhus. Thank you, Max. You do realize we’re at the early Proof on Concept stage, right?

Go ahead is a given (2019-10-30)

Gitte Bomholt Christensen deals with the public space at the library. She visited to take a look at the project. Her first question was if we should put the display on a movable stand or if a wall mount were better. We’ll take that as a “Yes, we’ll go forward with this experiment”.

Soon they will be five (2019-10-31)

Early in the day miniboss Katrine Hofmann Gasser asked for requirements for 3 extra touch displays. Later in the day, miniboss Katrine Hofmann Gasser had ordered 3 extra touch displays.

Damn, people! What happened to the concept of testing a minimum viable product followed by iterative improvement?

The hunt for 4K (2019-11-05)

The afternoon was not free (they never are), but at least it was not bogged down with other stuff. So what about upping the resolution from HD to 4K? How hard can that be? Yeah, 4 trips to Operations and 3 different computers produced the new knowledge that passive DisplayPort → HDMI cables have trouble delivering the goods. Native HDMI 1.4 handles 4k though: Admittedly at 30Hz only, but that works well enough when the interface reflects finger movements directly. The only situation where the 30Hz is noticeable is when the user pans by flinging.

Gridified tSNE (2019-11-06)

Running image collections through a trained network and positioning them according to their visual similarity is one of those “the future is now”-things. One favourite is Pix-Plot which produces an interactive 3D-visualization. But the touch screen is meant for large images and Pix-Plot is not made to display those. Plotting directly to 2D does not solve the problem:

A bit hard to enjoy all the images when they cover each other

A marriage between Pix-Plot and the existing zoomable grid-based layout was proposed. Some hacking later with the tools ml4a & RasterFairy and… Yeah, kinda-sorta? As can be seen on the screenshot below, there are definitely areas of similar images, but there are also multiple areas that looks like they should be collapsed into one. Something’s off, but that will have to be solved later.

There’s definitely some grouping there. And groups that looks suspiciously similar
There are no image duplicates – we checked!

Frontpage material (2019-11-14)

Thomas Egense wanted something else on the large touch screen, so he extracted all frontpages from the Danish newspaper Berlingske Tidende, available from Mediestream (of course he cheated and took an internal shortcut). It is quite an old newspaper, so “all” means 68,367 pages.

Rendering 68K fairly-high-res images is no technical problem, but as our scans are greyscale the result was somewhat bland and extremely cumbersome to navigate with intention.

A sea of grey

Thankfully newspaper frontpages do possess one singular reliable piece of metadata: The date of the paper. Adding an ugly date picker was easy enough and presto! Intuitive navigation of the primary navigational axis for the material.

Proper tSNE (2019-11-25)

A breakthrough discovery was made today: If you clumsily swap the x and y axis for the coordinates, but keep the calculated width and height, when you plot gridified data, the result is … less good. Corollary: If you un-swap said axes the result looks much better! As demonstrated by these before and after images:

Sorry about doing this on a not-fully-public-yet dataset (the awesome “anskuelsesbilleder” at AU Library, Emdrup): We can only show the scaled-down versions of the properly gridified tSNE layout, but they should convey the gist.

Maybe machines can label our stuff? (2019-11-27)

Since machine learning was great at positioning images according to visual similarity (or rather a mix of visual similarity and content similarity), maybe use it to automatically label our material? Well, not so much with the collection of anskuelsesbilleder: The network (imagenet) is great for labelling photographies but poor for drawings: “Binder”, “Web site”, “Envelope” and “Jigsaw puzzle” were recurring and absolutely wrong guesses. Again, sorry for not being allowed to show the images. Hopefully later, when the rights has been cleared, ok?

Ideas aplenty (2019-11-27)

Karen Williams was the nearest innocent bystander to show the latest experiment with the large touch screen and she upped the ante, asking for drive-by crowd-sourcing capabilities and visualization of sound. So much untapped potential in this!

Organisations gotta organise (2019-11-28)

One does not simply walk down a put a touch screen on the wall. It is hard to have patience with a new toy in hand, but it is understandable that a mix of people from different departments must participate on something that involves display of cultural heritage data in the public areas of the library. Unfortunately it will be nearly 2 weeks before said mix of people are able to meet and determine how to go about the project. Deep breath. Inner peace. Tempus fugit.

Pong detour (2019-12-06)

Annual christmas party time! And Jesper Lauridsen did not miss the oportunity that a big touch screen presented. He whipped up a multi-ball Pong game where the balls were the faces of the people at the party. Will you be hitting your colleague with a bat or let said colleague fall into oblivion? Great success and nobody spilled beer on the touch table! And no, sorry, not allowed to show it due to the face thing. Privacy and all.

Last details finished in the Real Hardware department by leet hacker Jesper

Proper public tSNE (2019-12-09)

The image classification → tSNE → RasterFairy → juxta chain is our new golden hammer, and the next collection to hits were our Maps & Atlases collection. Given that the network was never trained explicitly for the minute differences in maps, it went surprisingly well. And this time we’re allowed to show the result:

Don’t just sit there! Try it yourself

Secure, by which we mean “nearly succeed” (2019-12-10)

There was a meeting with The Right People and it took all of 3½ minute to decide that yes, the large tablet should definitely be displayed in the public areas. Then it took 10 minutes to hammer out the details: The plan is to mount in on wheels and move it around to see where it works best. Progress!

The afternoon was spend trying to make the big screen less easy to hack. It is driven by an Ubuntu 19.10 box using Google Chrome as the browser. As discovered earlier, Chrome has a “kiosk” mode, which disables right click, the address bar and more. Easy!

The real problem was Ubuntu itself: It has tablet support, meaning clever swipe gestures that activates program selection, unmaximizes windows, shows an on screen keyboard and other goodies. Goodies meaning baddies, when trying to build a display kiosk! Most of the solution was to use the Disable Gestures extension (and reboot to get the full disablement), but the on screen keyboard (activated by swiping in from the bottom of the screen) is apparently hard baked into the system (Block Caribou did not help us). We might have to uninstall it completely.

To be continued

Posted in eskildsen, Visualization | Leave a comment

DocValues jump tables in Lucene/Solr 8

Lucene/Solr 8 is about to be released. Among a lot of other things is brings LUCENE-8585, written by your truly with a heap of help from Adrien Grand. LUCENE-8585 introduces jump-tables for DocValues, is all about performance and brings speed-ups ranging from worse than baseline to 1000x, extremely dependent on index and access pattern.

This is a follow-up post to Faster DocValues in Lucene/Solr 7+. The previous post contains an in-depth technical explanation of the DocValues mechanisms, while this post focuses on the final implementation.


Whenever the content of a field is to be used for grouping, faceting, sorting, stats or streaming in Solr (or Elasticsearch or Lucene, where applicable), it is advisable to store it using DocValues. It is also used for document retrieval, depending on setup.

DocValues in Lucene 7: Linked lists

Lucene 7 shifted the API for DocValues from random access to sequential. This meant smaller storage footprint and cleaner code, but also caused the worst case single value lookup to scale linear with document count: Getting the value for a DocValued field from the last document in a segment required a visit to all other value blocks.

The linear access time was not a problem for small indexes or requests for values for a lot of documents, where most blocks needs to be visited anyway. Thus the downside of the change was largely unnoticeable or at least unnoticed. For some setups with larger indexes, it was very noticeable and for some of them it was also noticed. For our netarchive search setup, where each segments has 300M documents, there was a severe performance regression: 5-10x for common interactive use.

Text book optimization: Jump-tables

The Lucene 7 DocValues structure behaves as a linked list of data-nodes, with the specializations that it is build sequentially and that it is never updated after the build has finished. This makes it possible to collect the node offsets in the underlying index data during build and to store an array of these offsets along with the index data.

With the node offsets quickly accessible, worst-case access time for a DocValue entry becomes independent of document count. Of course, there is a lot more to this: See the previously mentioned Faster DocValues in Lucene/Solr 7+ for details.

One interesting detail for jump-tables is that they can be build both as a cache on first access (see LUCENE-8374) and baked into the index-data (see LUCENE-8585). I much preferred having both options available in Lucene, to get instant speed up with existing indexes and technically superior implementation for future indexes. Alas, only LUCENE-8585 was deemed acceptable.

Best case test case

Our netarchive search contains 89 Solr collections, each holding 300M documents in 900GB of index data. Each collection is 1 shard, merged down to 1 segment and never updated. Most fields are DocValues and they are heavily used for faceting, grouping, statistics, streaming exports and document retrieval. The impact of LUCENE-8585 should be significant.

In netarchive search, all collections are searched together using an alias. For the tests below only a single collection was used for practical reasons. There are three contenders:

  1. Unmodified Solr 7 collection, using Solr 8.0.0 RC1. Codename Solr 7.
    In this setup, jump-tables are not active as Solr 8.0.0 RC1, which includes LUCENE-8585, only supports index-time jump-tables. This is the same as Solr 7 behaviour.
  2. Solr 7 collection upgraded to Solr 8, using Solr 8.0.0 RC1. Codename Solr 8r1.
    In this setup, jump-tables are active and baked into the index data. This is the expected future behaviour when Solr 8 is released.
  3. Solr 7 collection, using Lucene/Solr at git commit point 05d728f57a28b9ab83208eda9e98c3b6a51830fc. Codename Solr 7 L8374.
    During LUCENE-8374 (search time jump tables) development, the implementation was committed to master. This was later reverted, but the checkout allow us to see what the performance would have been if this path had been chosen.

Test hardware is a puny 4-core i5 desktop with 16GB of RAM, a 6TB 7200RPM drive and a 1TB SSD. About 9GB of RAM free for disk cache. Due to time constraints only the streaming export test has been done on the spinning drive, the rest is SSD only.

Streaming exports

  • Premise: Solr’s export function is used by us to extract selected fields from the collection, typically to deliver a CSV-file with URLs, MIME types, file sizes etc for a corpus defined by a given filter. It requires DocValues to work.
  • DV-Problem: The current implementation of streaming export in Solr does not retrieve the field values in document order, making the access pattern extremely random. This is absolute worst case for sequential DocValues. Note that SOLR-13013 will hopefully remedy this at some point.

The test performs a streaming export of 4 fields for 52,653 documents in the 300M index. The same export is done 4 times, to measure the impact of caching.

curl '<solr>/export?q=text:hestevogn&sort=id+desc&
run 1
run 2
run 3
Solr 7 spin1705129713521314
Solr 8r1 spin834321
Solr 7 L8374 spin935111
Solr 7 SSD1276125812621262
Solr 8r1 SSD16121
Solr 7 L8374 SSD151

Observation: Both Solr 8r1 and Solr 7 L8374 vastly outperforms Solr 7. On a spinning drive there is a multi-minute penalty for run 1 after which the cache has been warmed. This is a well known phenomenon.


  • Premise: Faceting is used everywhere and it is a hard recommendation to use DocValues for the requested fields.
  • DV-Problem: Filling the counters used when faceting is done in document order, which works well with sequential access as long as the jumps aren’t too long: Small result sets are relatively heavier penalized than large result sets.
Simple term-based searches with top-20 faceting on 6 fields of varying type and cardinality: domain, crawl_year, public_suffix, content_type_norm, status_code and host.

Reading the graphs: All charts in this blog post follows the same recipe:

  • X-axis is hit count (aka result set size), y-axis is response time (lower is better)
  • Hit counts are bucketed by order of magnitude and for each magnitude, boxes are shown for the three contenders: Blue boxes are Solr 7, pink are Solr 8r1 and green are Solr 7 L8374
  • The bottom of a box is the 25 percentile, the top is the 75 percentile. The black line in the middle is the median. Minimum response time for the bucket is the bottom spike, while the top spike is 95 percentile
  • Maximum response times are not shown as they tend to jitter severely due to garbage collection

Observation: Modest gains from jump-tables with both Solr 8rc1 and Solr 7 L8374.
Surprisingly the gains scale with hit count, which should be investigated further.


  • Premise: Grouping is used in netarchive search to collapse multiple harvests of the same URL. As with faceting, using DocValues for grouping fields are highly recommended.
  • DV-Problem: As with faceting, group values are retrieved in document order and follows the same performance/scale logic.
Simple term-based searches with grouping on the high-cardinality (~250M unique values) field url_norm.

Observations: Modest gains from jump-tables, similar to faceting.


  • Premise: Sorting is a basic functionality.
  • DV-Problem: As with faceting and grouping, the values used for sorting are retrieved in document order and follows the same performance/scale logic.

This tests performs simple term-based searches with sorting on the high-cardinality field content_length.

Observations: Modest gains from jump-tables.
Contrary to faceting and grouping, performance for high hit counts are the same for all 3 setups, which fits with the theoretical model.
Positively surprising is that the theoretical overhead of the jump-tables does not show for higher hit counts.

Document retrieval

  • Premise: Content intended for later retrieval can either be stored explicitly or as docValues. Doing both means extra storage, but also means that everything is retrieved from the same stored (and compressed) blocks, minimizing random access to the data. For the netarchive search at the Royal Danish Library we don’t double-store field data and nearly all of the 70 retrievable fields are docValues.
  • DV-Problem: Getting a search result is a multi-step process. Early on, the top-X documents matching the query are calculated and their IDs are collected. After that the IDs are used for retrieving document representations. If this is done from DocValues, it means random access linear to the number of documents requested.
Simple term-based relevance ranked searches for the top-20 matching documents with 9 core fields: id, source_file_s, url_norm, host, domain, content_type_served, content_length, crawl_date and content_language.

Observations: Solid performance improvement with jump-tables.

Production request

  • Premise: The different functionalities are usually requested in combination. At netarchive search a typical request uses grouping, faceting, cardinality counting and top-20 document retrieval.
  • DV-Problem: Combining functionality often means that separate parts of the index data are accessed. This can cause cache thrashing if there is not enough free memory for disk cache. With sequential DocValues, all intermediate blocks needs to be visited, increasing the need for disk cache. Jump-tables lowers the number of storage requests and are thus less reliant on cache size.
Simple term-based relevance ranked searches for the top-20 matching documents, doing grouping, faceting, cardinality and document retrieval as described in the tests above.

Observations: Solid performance improvement with jump tables.
As with the previous analysis of search-time jump tables, utilizing multiple DocValues-using functionality has a cocktail effect where the combined impact is larger than the sum of the parts. This might be due to disk cache thrashing.

Overall observations & conclusions

  • The effect of jump tables, both with Solr 8.0.0 RC1 and LUCENE-8374, is fairly limited; except for export and document retrieval, where the gains are solid.
  • The two different implementations of jump tables performs very similar. Do remember that these tests does not involve index updates at all: As LUCENE-8374 is search-time, it does have a startup penalty when indexes are updated.

For a the large segment index tested above, the positive impact of jump tables is clear. Furthermore there is no significant slow down for higher hit counts with faceting/grouping/statistics, where the jump tables has no positive impact.

Before running these tests, it was my suspicion that the search-time jump tables in LUCENE-8374 would perform better than the baked-in version. This showed not to be the case. As such, my idea of combining the approaches by creating in-memory copies of some of the on-storage jump tables has been shelved.


Performance testing is never complete, it just stops. Some interesting thing to explore could be

  • Spinning drives
  • Concurrent requests
  • Raw search speed with rows=0
  • Smaller corpus
  • Variations of rows, facet.limit and group.limit
  • Kibana and similar data-visualization tools
Posted in eskildsen, Hacking, Low-level, Lucene, Performance, Solr, Uncategorized | 7 Comments

Faster DocValues in Lucene/Solr 7+

This is a fairly technical post explaining LUCENE-8374 and its implications on Lucene, Solr and (qualified guess) Elasticsearch search and retrieval speed. It is primarily relevant for people with indexes of 100M+ documents.


We have a Solr setup for Netarchive Search at the Royal Danish Library. Below are response times grouped by the magnitude of the hitCount with and without the Lucene patch.


Grouping on url_norm, cardinality stats on url_norm, faceting on 6 fields and retrieval of all stored & docValued fields for the top-10 documents in our search result.

As can be seen, the median response time with the patch is about half that of vanilla Solr. The 95% percentile shows that the outliers has also been markedly reduced.

Long explanation follows as to what the patch does and why indexes with less than 100M documents are not likely to see the same performance boost.

Lucene/Solr (birds eye)

Lucene is a framework for building search engines. Solr is a search engine build using Lucene. Lucene, and thereby Solr, is known as an inverted index, referring to the termsdocuments structure that ensures fast searches in large amounts of material.

As with most things, the truth is a little more complicated. Fast searches are not enough: Quite obviously it also helps to deliver a rich document representation as part of the search. More advanced features are grouping, faceting, statistics, mass exports etc. All of these have in common that they at some point needs to map documentsterms.

Lucene indexes are logically made up of segments containing documents made up of fields containing terms (or numbers/booleans/raws…). Fields can be

  • indexed for searching, which means termsdocuments lookup
  • stored for document retrieval
  • docValues for documentsterms lookup

stored and docValues representations can both be used for building a document representation as part of common search. stored cannot be used for grouping, faceting and similar purposes. The two strengths of stored are

  • Compression, which is most effective for “large” content.
  • Locality, meaning that all the terms for stored fields for a given document are stored together, making is low-cost to retrieve the content for multiple fields.

Whenever grouping, faceting etc. needs the documentsterms mapping, it can either be resolved from docValues, which are build for this exact purpose, or by un-inverting the indexed terms. Un-inversion costs time & memory, so the strong recommendation is to enable docValues for grouping, faceting etc.

DocValues in Lucene/Solr 7+ (technical)

So the mission is to provide a documentsterms (and numbers/booleans/etc) lookup mechanism. In Lucene/Solr 4, 5 & 6 this mechanism had a random access API, meaning that terms could be requested for documents in no particular order. The implementation presented some challenges and from Lucene/Solr 7 this was changed to an iterator API (see LUCENE-7407), meaning that terms must be resolved in increasing document ID order. If the terms are needed for a document with a lower ID that previously requested, a new iterator must be created and the iteration starts from the beginning.

Most of the code for this is available in Lucene70DocValuesProducer and IndexedDISI. Digging into it, the gains from the iterative approach becomes apparent: Besides a very clean implementation with lower risk of errors, the representation is very compact and requires very little heap to access. Indeed, the heap requirement for the search nodes in Netarchive Search at the Royal Danish Library was nearly halved when upgrading from Solr 4 to Solr 7. The compact representation is primarily the work of Adrian Grand in LUCENE-7489 and LUCENE-7589.

When reading the wall of text below, it helps to mentally view the structures as linked lists: To get to a certain point in the list, all the entries between the current entry and the destination entry needs to be visited.

DocValues sparseness and packing

It is often the case that not all documents contains terms for a given field. When this is case, the field is called sparse.

A trivial representation for mapping documentsterms for a field with 0 or 1 long values per document would be an array of long[#documents_in_segment], but this takes up 8 bytes/document, whether the document has a value defined or not.

LUCENE-7489 optimizes sparse values by using indirection: First step is to determine whether a document has a value or not. If it has a value, an index into a value-structure is derived. The second step is to retrieve the value from the value-structure. IndedDISI takes care of the first step:

For each DocValues field, documents are grouped in blocks of 65536 documents. Each block starts with meta-data stating the block-ID and the number of documents in the block that has a value for the field. There are 4 types of blocks:

  • EMPTY: 0 documents in the block has a term.
  • SPARSE: 1-4095 documents in the block has a term.
  • DENSE: 4096-65535 documents in the block has a term.
  • ALL: 65536 documents in the block has a term.

Step 1.1: Block skipping

To determine if a document has a value and what the index of the value is, the following pseudo-code is used:

while (blockIndex < docID/65536) {
  valueIndex += block.documents_with_values_count
  block = seekToBlock(block.nextBlockOffset)
if (!block.hasValue(docID%65536)) {  // No value for docID
valueIndex += block.valueIndex(docID%65536)

Unfortunately it does not scale with index size: At the Netarchive at the Royal Danish Library, we use segments with 300M values (not a common use case), which means that 4,500 blocks must be iterated in the worst case.

Introducing an indexValue cache solves this and the code becomes

valueIndex = valueCache[docID/65536]
block = seekToBlock(offsetCache[docID/65536])
if (!block.hasValue(docID%65536) {  // No value for docID
valueIndex += block.valueIndex(docID%65536)

The while-loop has been removed and getting to the needed block is constant-time.

Step 1.2: Block internals

Determining the value index inside of the block is trivial for EMPTY and ALL blocks. SPARSE is a list of the documentIDs with values that is simply iterated (this could be a binary search). This leaves DENSE, which is the interesting one.

DENSE blocks contains a bit for each of its 65536 documents, represented as a bitmap = long[1024]. Getting the value index is a matter of counting the set bits up to the wanted document ID:

inBlockID = docID%65536
while (inBlockIndex < inBlockID/64) {
  valueIndex += total_set_bits(bitmap[inBlockIndex++])
valueIndex += set_bits_up_to(bitmap[inBlockIndex], inBlockID%64)

This is not as bad as it seems as counting bits in a long is a single processor instruction on modern CPUs. Still, doing 1024 of anything to get a value is a bit much and this worst-case is valid for even small indexes.

This is solved by introducing another cache: rank = char[256] (a char is 16 bytes):

inBlockID = docID%65536
valueIndex = rank[inBlockID/8]
inBlockIndex = inBlockID/8*8
while (inBlockIndex < inBlockID/64) {
  valueIndex += total_set_bits(bitmap[inBlockIndex++])
valueIndex += set_bits_up_to(bitmap[inBlockIndex], inBlockID%64)

Worst-case it reduced to a rank-cache lookup and summing of the bits from 8 longs.

Now that step 1: Value existence and value index has been taken care of, the value itself needs to be resolved.

Step 2: Whole numbers representation

There are different types of values Lucene/Solr: Strings, whole numbers, floating point numbers, booleans and binaries. On top of that a field can be single- or multi-valued. Most of these values are represented in a way that provides direct lookup in Lucene/Solr 7, but whole numbers are special.

In Java whole numbers are represented in a fixed amount of bytes, depending on type: 1 byte for byte, 2 bytes for short or char, 4 bytes for integer and 8 bytes for long. This is often wasteful: The sequence [0, 3, 2, 1] could be represented using only 2 bits/value. The sequence [0, 30000000, 20000000, 10000000] could also be represented using only 2 bits/value if it is known that the greatest common divisor is 10⁷. The list of tricks goes on.

For whole numbers, Lucene/Solr uses both the smallest amount of bits required by PackedInts for a given sequence as well as greatest common divisor and constant offset. These compression techniques works poorly both for very short sequences and for very long ones; LUCENE-7589 splits whole numbers into sequences of 16384 numbers.

Getting the value for a given index is a matter of locating the right block and extracting the value from that block:

while (longBlockIndex < valueIndex/16386) {
  longBlock = seekToLongBlock(longBlock.nextBlockOffset)
value = longBlock.getValue(valueIndex%16386)

This uses the same principle as for value existence and the penalty for iteration is also there: In our 300M documents/segment index, we have 2 numeric fields where most values are present. They have 28,000 blocks each, which must be all be visited in the worst case.

The optimization is the same as for value existence: Introduce a jump table.

longBlock = seekToLongBlock(longJumps[valueIndex/16384))
value = longBlock.getValue(valueIndex%16386)

Value retrieval becomes constant time.

Theoretical observations

  • With a pure iterative approach, performance goes down when segment size goes up and the amount of data to retrieve goes up slower than index size. The performance slowdown only happens after a certain point! As long as the gap between the docIDs is small enough to be within the current or the subsequent data chunk, pure iteration is fine.
    Consequently, the requests that involves lots of monotonically increasing docID lookups (faceting, sorting & grouping for large result sets) fits the iterative API well as they needs data from most data blocks.
    Requests that involves fewer monotonically increasing docID lookups (export & document retrieval for all requests, faceting, sorting & grouping for small result sets) fits poorly as they result in iteration over data blocks that do not provide any other information than a link to the next data block.
  • As all the structures are storage-backed, iterating all data blocks – even when it is just to get a pointer to the next block – means a read request. This is problematic, unless there is plenty of RAM for caching: Besides the direct read-time impact, the docValues structures will hog the disk cache.

With this in mind, it makes sense to check the patch itself for performance regressions with requests for a lot of values as well as test with the disk cache fully warmed and containing the structures that are used. Alas, this has to go on the to-do for now.


Hardware & index

Testing was done against our production Netarchive Search. It consists of 84 collections, accessed as a single collection using Solr’s alias mechanism. Each collection is roughly 300M documents / 900GB of index data optimized to 1 segment, each segment on a separate SSD. Each machine has 384GB of RAM with about 220GB free for disk cache. There are 4 machines, each serving 25 collections (except the last one that only serves 9 at the moment). This means that ~1% of total index size is disk cached.


  • Queries were constructed by extracting terms of varying use from the index and permutating them for simple 1-4 term queries
  • All tests were against the full production index, issued at times when it was not heavily used
  • Queries were issued single-threaded, with no repeat queries
  • All test setups were executed 3 times, with a new set of queries each time
  • The order of patch vs. sans-patch tests was always patch first, to ensure that any difference in patch favour was not due to better disk caching

How to read the charts

All charts are plotted with number of hits on the x-axis and time on the y-axis. The x-axis is logarithmic with the number of hits bucketed by magnitude: First bucket holds all measurements with 1-9 hits, second bucket holds those with 10-99 hits, the third holds those with 100-999 hits and so forth.

The response times are displayed as box plots where

  • Upper whisker is the 95% percentile
  • Top of the box is 75% percentile
  • Black bar is 50% percentile (the median)
  • Bottom of the box is 25% percentile
  • Lower whisker is minimum measured time

Each bucket holds 4 boxes

  • Test run 2, patch enabled
  • Test run 2, vanilla Solr
  • Test run 3, patch enabled
  • Test run 3, vanilla Solr

Test run 1 is discarded to avoid jitter from cache warming. Ideally the boxes from run 3 should be the same as for run 2. However, as the queries are always new and unique, an amount of variation is to be expected.

Important note 1: The Y-axis max-value changes between some of the charts.

Document retrieval

There seems to be some disagreement as to whether the docValues mechanism should ever be used to populate documents, as opposed to using stored. This blog post will only note that docValues are indeed used for this purpose at the Royal Danish Library and let it be up to the reader to seek more information on the matter.

There are about 70 fields in total in Netarchive Search, with the vast majority being docValued String fields. There are 6 numeric DocValued fields.


Retrieval of top-20 documents with all field values

Observation: Response times for patched (blue & green) are markedly lower than vanilla (ping & orange). The difference is fairly independent of hit count, which matches well with the premise that the result set size is constant at 20 documents.


Grouping on the String field url_norm field is used in Netarchive Search to avoid seeing too many duplicates. To remove the pronounced difference caused by document retrieval, only the single field url_norm is requested for only 1 group with 1 document.


Grouping on url_norm

Observation: The medians for patched vs. vanilla are about the same, with a slight edge to patched. The outliers (the top T of the boxes) are higher for vanilla.


Faceting is done for 6 fields of varying cardinality. As with grouping, the effect of document retrieval is sought minimized.


Faceting on fields domain, crawl_year, public_suffix, content_type_norm, status_code, host

Observation: Patched is an improvement over vanilla up to 10M+ hits.


In this test, sorting is done descending on content_length, to locate the largest documents in the index. As with grouping, the effect of document retrieval is sought minimized.


Sorting on content_length

Observation: Patched is a slight improvement over vanilla.


In order to provide an approximate hitCount with grouping, the cardinality of the url_norm field is requested. As with grouping, the effect of document retrieval is sought minimized.

HyperLogLog cardinality on url_norm

HyperLogLog cardinality on url_norm

Observation: Too much jitter to say if patch helps here.

Numeric statistics

Statistics (min, max, average…) on content_length is a common use case in Netarchive Search. As with grouping, the effect of document retrieval is sought minimized.


Numeric statistics on content_length

Observation: Patched is a slight improvement over vanilla.

Cocktail effect, sans document

Combining faceting, grouping, stats and sorting while still minimizing the effect of document retrieval.


Faceting on 6 fields, grouping on url_norm, stats on content_length and sorting on content_length

Observation: Patched is a clear improvement over vanilla.

Production request combination

The SolrWayback front end for Netarchive Search commonly use document retrieval for top-10 results, grouping, cardinality and faceting. This is the same chart as the teaser at the top, with the addition of test run 2.


Grouping on url_norm, cardinality stats on url_norm, faceting on 6 fields and retrieval of all stored & docValued fields for the top-10 documents in our search result.

Observation: Patched is a pronounced improvement over vanilla.

The combination of multiple docValues using request parameters is interesting as the effect of the patch on the whole seems greater than the sum of the individual parts. This could be explained by cache/IO saturation when using vanilla Solr. Whether the cause, this shows that it is important to try and simulate real-world workflows as close as possible.

Overall observations

  • For most of the performance tests, the effect of the LUCENE-8374 patch vs. vanilla is pronounced, but limited in magnitude
  • Besides lowing the median, there seems to be a tendency for the patch to reduce outliers, notably for  grouping
  • For document retrieval, the patch improved performance significantly. Separate experiments shows that export gets a similar speed boost
  • For all the single-feature tests, the active parts of the index data are so small that they are probably cached. Coupled with the limited improvement that the patch gives for these tests, it indicates that the patch will in general have little effect on systems where the full index is is disk cache
  • The performance gains with the “Production request combination” aka the standard requests from our researchers, are very high

Future testing

  • Potential regression for large hit counts
  • Max response times (not just percentile 95)
  • Concurrent requests
  • IO-load during tests
  • Smaller corpus
  • Export/streaming
  • Disk cache influence

Want to try?

There is a patch for Solr trunk at LUCENE-8374 and it needs third party validation from people with large indexes. I’ll port it to any Solr 7.x-version requested and if building Solr is a problem, I can also compile it and put it somewhere.

Hopefully it will be part of Solr at some point.

Update 20181003: Patch overhead and status

Currently the patch is search-time only. Technically is could also be index-time by modifying the codec.

For a single index in the Netarchive Search setup, the patch adds 13 seconds to first search-time and 31MB of heap out of 8GB allocated for the whole Solr. The 13 seconds is in the same ballpark (this is hard to measure) as a single unwarmed search with top-1000 document retrieval.

The patch is ported to Solr 7.3.0 and used in production at the Royal Danish Library. It is a debug-patch, meaning that the individual optimizations can be enabled selectively for easy performance comparison.

See the LUCENE-8374 JIRA-issue for details.

Posted in eskildsen, Hacking, Low-level, Lucene, Performance, Solr | 1 Comment

Prebuild Big Data Word2Vec dictionaries

                   Prebuild and trained Word2Vec dictionaries ready for use

Two different prebuild big data Word2Vec dictionaries has been added to LOAR (Library Open Access Repository) for download. These dictionaries are build from the text of 55,000 e-books from Project Gutenberg and 32.000.000 Danish newspaper pages.

35.000 of the Gutenberg e-books are English, but over 50 different languages are present in the dictionaries. Even though they are different languages the Word2Vec algorithm did a good job of separating the different languages so it is almost like 50 different Word2Vec dictionaries.

The text from the danish newspapers is not public available so you would not be able to build this dictionary yourself. A total of 300Gb of raw text went into building the dictionary, so it is probably the largest Word2Vec dictionary build on a Danish corpus. Since the danish newspapers suffer from low quality OCR, many of words in the dictionary are misspellings. Using this dictionary it was possible to fix many of the OCR errors due the nature of the Word2Vec algorithm, since a given word appears in similar contexts despite its misspellings and is identified by its context. (see https://sbdevel.wordpress.com/2017/02/02/automated-improvement-of-search-in-low-quality-ocr-using-word2vec/)

Download and more information about the Word2Vec dictionaries:



Online demo of the two corpora: Word2Vec demo







Posted in Uncategorized | Leave a comment

SolrWayback software bundle has been released

The SolrWayback software bundle can be used to search and playback archived webpages in Warc format. It is an out of the box solution with index workflow, Solr and Tomcat webserver and a free text search interface with playback functionality. Just add your Warc to a folder and start the index job.

The search interface has additional features besides freetext search. This includes:

  • Image search similar to google images
  • Search by uploading a file. (image/pdf etc.) See if the resource has been harvested and from where.
  • Link graph showing links (ingoing/outgoing) for domains using the D3 javascript framework.
  • Raw download of any harvested resource from the binary Arc/Warc file.
  • Export a search resultset to a Warc-file. Streaming download, no limit of size of resultset.
  • An optional built in SOCKS proxy can be used to view historical webpages without browser leaking resources from the live web.

See the GitHub page for screenshots of SolrWayback and scroll down to the install guide try it out.

Link: SolrWayback





Posted in Uncategorized | Leave a comment

Visualising Netarchive Harvests


An overview of website harvest data is important for both research and development operations in the netarchive team at Det Kgl. Bibliotek. In this post we present a recent frontend visualisation widget we have made.

From the SolrWayback Machine we can extract an array of dates of all harvests of a given URL. These dates are processed in the browser into a data object containing the years, months, weeks and days to enable us to visualise the data. Futhermore the data is given an activity level from 0-4.

The high-level overview seen below is the year-month graph. Each cell is coloured based on the activity level relative to the number of harvests in the most active month. For now we use a linear calculation so gray means no activity, activity level 1 is 0-25% of the most active month, and level 4 is 75-100% of the most active month. As GitHub fans, we have borrowed the activity level colouring from the user commit heat map.



We can visualise a more detailed view of the data as either a week-day view of the whole year, or as a view of all days since the first harvest. Clicking one of these days reveals all the harvests for the given day, with links back to SolrWayback to see a particular harvest.



In the graph above we see the days of all weeks of 2009 as vertical rows. The same visualisation can be made for all harvest data for the URL, as seen below (cut off before 2011, for this blog post).



There are both advantages and disadvantages to using the browser-based data calculation. One of the main advantages is a very portable frontend application. It can be used with any backend application that outputs an array of dates. The initial idea was to make the application usable for several different in-house projects. Drawbacks to this approach is, of course, the scalability. Currently the application processes 25.000 dates in about 3-5 seconds on the computer used to develop the application (a 2016 quad core Intel i5).

The application uses the frontend library VueJS and only one other dependency, the date-fns library. It is completely self-contained and it is included in a single script tag, including styles.

Ideas for further development.

We would like to expand this to also include both:

  1. multiple URLs, which would be nice for resources that have changed domain, subdomain or protocol over time, e.g. the URL http://pol.dk, http://www.pol.dk and https://politiken.dk could be used for the danish newspaper Politiken.
  2. domain visualisation for all URLs on a domain. A challenge here will of course be the resources needed to process the data in the browser. Perhaps a better calculation method must be used – or a kind of lazy loading.
Posted in Blogging, Solr, Web | Tagged , | Leave a comment

SolrWayback Machine

Another ‘google innovation week’ at work has produced the SolrWayback Machine. It works similar to the Internet Archive: Wayback Machine (https://archive.org/web/) and can be used to show harvested web content (Warc files).  The Danish Internet Archive has over 20billion harvested web objects and takes 1 Petabyte of storage.

The SolrWayback engine require you have indexed the Warc files using the Warc-indexer tool from British Library. (https://github.com/ukwa/webarchive-discovery/tree/master/warc-indexer).

It is quite fast and comes with some additional features as well:

  •  Image search similar to google images
  •  Link graphs showing  links (ingoing/outgoing) for domains using the D3 javascript framework.
  •  Raw download of any harvested resource from the binary Arc/Warc file.

Unfortunately  the collection is not available for the public so I can not show you the demo. But here is a few pictures from the SolrWayback machine.



SolrWayback at GitHub: https://github.com/netarchivesuite/solrwayback/

Posted in Uncategorized | 1 Comment

juxta – image collage with metadata

Creating large collages of images to give a bird’s eye view of a collection seems to be gaining traction. Two recent initiatives:

Combining those two ideas seemed like a logical next step and juxta was born: A fairly small bash-script for creating million-scale collages of images, with no special server side.  There’s a small (just 1000 images) demo at SBLabs.

Presentation principle

The goal is to provide a seamless transition from the full collection to individual items, making it possible to compare nearby items with each other and locate interesting ones. Contextual metadata should be provided for general information and provenance.

Concretely, the user is presented with all images at once and can zoom in to individual images in full size. Beyond a given threshold, metadata are show for the image currently under the cursor, or finger if a mobile device is used. An image description is displayed just below the focused image, to avoid disturbing the view. A link to the source of the image is provided on top.


Overview of historical maps


Meta-data for a specific map

Technical notes, mostly on scaling

On the display side, OpenSeadragon takes care of the nice zooming. When the user moves the focus, a tiny bit of JavaScript spatial math resolves image identity and visual boundaries.

OpenSeadragon uses pyramid tiles for display and supports the Deep Zoom protocol can be implemented using only static files. The image to display is made up of tiles of (typically) 256×256 pixels. When the view is fully zoomed, only the tiles within the viewport are requested. When the user zooms out, the tiles from the level above are used. The level above is half the width and half the height and is thus represented by ¼ the amount of tiles. And so forth.

Generating tiles is heavy

A direct way of creating the tiles is

  1. Create one large image of the full collage (ImageMagick’s montage is good for this)
  2. Generate tiles for the image
  3. Scale the image down to 50%×50%
  4. If the image is larger than 1×1 pixel then goto 2

Unfortunately this does not scale particularly well. Depending on size and tools, it can take up terabytes of temporary disk space to create the full collage image.

By introducing a size constraint, juxta removes this step: All individual source images are scaled & padded to have the exact same size. The width and height of the images are exact multiples of 256. Then the tiles can be created by

  1. For each individual source image, scale, pad and split the image directly into tiles
  2. Create the tiles at the level above individually by joining the corresponding 4 tiles below and scale to 50%×50% size
  3. If there are more than 1 tile or that tile is larger than 1×1 pixel then goto 2

As the tiles are generated directly from either source images or other tiles, there is no temporary storage overhead. As each source image and each tile are processed individually, it is simple to do parallel processing.

Metadata takes up space too

Displaying image-specific metadata is simple when there are just a few thousand images: Use an in-memory array of Strings to hold the metadata and fetch it directly from there. But when the number of images goes into the millions, this quickly becomes unwieldy.

juxta groups the images spatially in buckets of 50×50 images. The metadata for all the images in a bucket are stored in the same file. When the user moved the focus to a new image, the relevant bucket is fetched from the server and the metadata are extracted. A bucket cache is used to minimize repeat calls.

Most file systems don’t like to hold a lot of files in the same folder

While the limits differ, common file systems such as ext, hfs & ntfs all experience performance degradation with high numbers of files in the same folder.

The Deep Zoom protocol in conjunction with file-based tiles means that the amount of files at the deepest zoom level is linear to the number of source images. If there are 1 million source images, with full-zoom size 512×512 pixels (2×2 tiles), the number of files in a single folder will be 2*2*1M = 4 million. Far beyond the comfort-zone fo the mentioned file systems (see the juxta readme for tests of performance degradation).

juxta mitigates this by bucketing tiles in sub-folders. This ensures linear scaling of build time at least up to 5-10 million images. 100 million+ images would likely deteriorate build performance markedly, but at that point we are also entering “is there enough free inodes on the file system?” territory.

Unfortunately the bucketing of the tile files is not in the Deep Zoom standard. With OpenSeadragon, it is very easy to change the mapping, but it might be more difficult for other Deep Zoom-expecting tools.

Some numbers

Using a fairly modern i5 desktop and 3 threads, generating a collage of 280 5MPixel images, scaled down to 1024×768 pixels (4×3 tiles) took 88 seconds or about 3 images/second. Repeating the experiment with a down-scale to 256×256 pixels (smallest possible size) raised the speed to about 7½ image/second.

juxta comes with a scale-testing script that generates sample images that are close (but not equal) to the wanted size and repeats them for the collage. With this near-ideal match, processing speed was 5½ images/second for 4×3 tiles and 33 images/second for 1×1 tiles.

The scale-test script has been used up to 5 million images, with processing time practically linear to the number of images. At 33 images/second that is 42 hours.

Posted in Uncategorized | Leave a comment

Automated improvement of search in low quality OCR using Word2Vec

This abstract has been accepted for Digital Humanities in the Nordic Countries 2nd Conference, http://dhn2017.eu/

In the Danish Newspaper Archive[1] you can search and view 26 million newspaper pages. The search engine[2] uses OCR (optical character recognition) from scanned pages but often the software converting the scanned images to text makes reading errors. As a result the search engine will miss matching words due to OCR error. Since many of our newspapers are old and the scans/microfilms is also low quality, the resulting OCR constitutes a substantial problem. In addition, the OCR converter performs poorly with old font types such as fraktur.

One way to find OCR errors is by using the unsupervised Word2Vec[3] learning algorithm. This algorithm identifies words that appear in similar contexts. For a corpus with perfect spelling the algorithm will detect similar words synonyms, conjugations, declensions etc. In the case of a corpus with OCR errors the Word2Vec algorithm will find the misspellings of a given word either from bad OCR or in some cases journalists. A given word appears in similar contexts despite its misspellings and is identified by its context. For this to work the Word2Vec algorithm requires a huge corpus and for the newspapers we had 140GB of raw text.

Given the words returned by Word2Vec we use a Danish dictionary to remove the same word in different grammatical forms. The remaining words are filtered by a similarity measure using an extended version of Levenshtein distance taking the length of the word and an idempotent normalization taking frequent one and two character OCR errors into account.

Example: Let’s say you use the Word2Vec to find words for banana and it returns: hanana, bananas, apple, orange. Remove bananas using the (English) dictionary since this is not an OCR error. For the three remaining words only hanana is close to banana and it is thus the only misspelling of banana found in this example. The Word2Vec algorithm does not know how a words is spelled/misspelled, it only uses the semantic and syntactic context.

This method is not an automatic OCR error corrector and cannot output the corrected OCR. But when searching it will appear as if you are searching in an OCR corrected text corpus. Single word searches on the full corpus gives an increase from 3% to 20% in the number of results returned. Preliminary tests on the full corpus shows only relative few false positives among the additional results returned, thus increasing recall substantially without a decline in precision.

The advantage of this approach is a quick win with minimum impact on a search engine [2] based on low quality OCR. The algorithm generates a text file with synonyms that can be used by the search engine. Not only single words but also phrase search with highlighting works out of the box. An OCR correction demo[4] using Word2Vec on the Danish newspaper corpus is available on the Labs[5] pages of The State And University Library, Denmark.

[1] Mediestream, The Danish digitized newspaper archive.

[2] SOLR or Elasticsearch etc.

[3] Mikolov et al., Efficient Estimation of Word Representations in Vector Space

[4] OCR error detection demo (change word parameter in URL)

[5] Labs for State And University Library, Denmark


Posted in Uncategorized | 2 Comments