Sparse facet counting without the downsides

April 4, 2014 by

The SOLR-5894 issue “Speed up high-cardinality facets with sparse counters” is close to being functionally complete (facet.method=fcs and facet.sort=index still pending). This post explains the different tricks used in the implementation and their impact on performance.

Baseline

Most of the different Solr faceting methods (fc & fcs; with and without doc-values; single- and multi-value) uses the same overall principle for counting tag occurrences in facets:

  1. Allocate one or more counters of total size #unique_tags
  2. Fill the counters by iterating a hit queue (normally a bitmap) and getting corresponding counter indexes from a mapper
  3. Extract top-x tags with highest count by iterating all counters

There are 3 problems with this 3 step process: Allocation of a (potentially large) structure from memory, iteration of a bitmap with #total_documents entries and iteration of a counter with #unique_tags. Ideally this would be no allocation, iteration of just the IDs of the matched documents and iteration of just the tag counters that were updated. Sparse facet counting solves 2 out of the 3 problems.

Sparse

In this context sparse is seen as performance-enhancing, not space-reducing. SOLR-5894 solves the extraction time problem by keeping track of which counters are updated. With this information, the extraction process no longer needs to visit all counters.  A detailed explanation can be found at fast-faceting-with-high-cardinality-and-small-result-set. However, there are some peculiarities to sparse tracking that must be considered.

Processing overhead

Naive sparse faceting

The black line is Solr field faceting on a multi-valued field (3 values/document), the red line is the sparse implementation on the same field. When the result set is small, sparse processing time is markedly lower than standard, but when the result set is > 10% of all documents, it becomes slower. When the result set reaches 50%, sparse takes twice as long as standard.

This makes sense when one consider that both updating and extraction of a single counter has more processing overhead for sparse: When the number of hits rises, the accumulated overhead gets bad.

Maximum sparse size

Okay, so tracking does not make much sense past a given point. Besides, having a tracker the size of the counters themselves (100% overhead) seems a bit wasteful. Fixing the tracker size to the cross-over-point is the way to go. We choose 8% here. Thanks to the beauty of the tracking mechanism, exceeding the tracker capacity does not invalidate the collected results, it just means a logical switch to non-track-mode.

8% tracker

No doubt about where the sparse counter switches to non-sparse mode. Note how the distance from Solr standard (black line) to sparse with tracker-overflow (red line past 8%) is near-constant: Up until 8% there is an overhead for updating the tracker. When the tracker has overflowed that overhead disappears for the rest of the counter updates, but the cycles used for tracking up to that point are wasted.

Selection strategy

So memory overhead was reduced to 8% and performance was better for the very high hit counts, but still quite a bit worse than Solr standard. If only we could foresee if the sparse tracker would be overflowed or not.

We cannot determine 100% whether the tracker will be blown or not (at least not in the general case), but we can guess. Under the assumption that the references from documents to tags are fairly uniformly distributed, we can use the hit count (which we know when we start facet calculation) to guess whether the number of updated tag counts will exceed the tracker capacity.

Sparse bad guesses for cut-off

The chart demonstrates how bad guessing of the result size affects performance. The conservative guess (red line) means that many of the faceting calls are done by falling back to standard Solr and that the sparse speed-up is wasted. The optimistic guess (cyan line) has a higher risk of failed sparse-attempts, which means bad performance. In this example, the bad guess was around 10%. Still, even with such hiccups, the overall positive effect of using sparse counting is clear.

Good guessing

The best cut-off point for sparse/non-sparse attempt depends on the corpus and the searches, as well as the willingness to risk increased response times. Properly tuned and with a corpus without extreme outliers (such as a single very popular document referencing 10% of all tags), the result will be very satisfying.

Sparse good guess

For the low price of 8% memory overhead we get much better performance for small result sets and no penalty for larger result sets (under the assumption of correct guessing).

Counter allocation

Doing a little instrumentation it becomes clear that it is by no means free just to allocate a new counter structure with each facet call and throw it away after use. In the example above, 5M*3*4byte = 60MB are used for a counter. With a 2GB heap and otherwise non-tweaked execution of Solr’s start.jar, the average time used to allocate the 60MB was 13ms!

An alternative strategy is to keep a pool of counters and re-use them. This means that counters must be cleared after use, but this is markedly faster than allocating new ones. Furthermore this can be done by a background thread, so that the client can get the response immediately after the extraction phase. Enabling this, the picture gets even better.

Sparse everything

For very small result sets there is virtually no performance penalty for faceting.

Sparse facet counting on a web archive

March 20, 2014 by

This post is a folow-up to Sparse facet counting on a real index. Where the last post explored using a sparse counter for faceting on author on Statsbibliotekets index of library material, this post will focus on faceting on url for our test index of archived web pages.

The index and the url field

The test index is 1TB (or 959 GB to be precise) with 420 million documents. The documents are made from harvested web pages and linked files of all types. Tika was used for the extraction of meta data. All documents has exactly one URL. Some documents are harvested multiple times, so there are only 315 million unique URLs instead of 420 million. The most popular URL occurs 3303 times . The index is optimized down to a single segment, but this has very little impact on tag collection speed.

Sparse counting

As described in Fast faceting with high cardinality and small result set, adding an overhead of 1/40 to the counting structure allows for faster counting with small result sets and graceful fallback to larger result sets. Using 32 bit integers as counters, this means 4*315 Mbyte for the counters themselved + 7 MByte for the sparse tracker. Including fluff, this turns out to be 1228 MByte for our test corpus.

Sparse packed counting

But wait! We know that there are at most 3303 occurrences of any single tag, so why use 32 bit integers as counters? 2^12 = 4096: We only need 12 bits for each. PackedInts to the rescue and we only need 506 MByte for a full counter structure. Less than half of the non-packed version.

Testing methodology

As this test is about the difference between non-sparse and sparse facet counting, the test was executed twice and only the results from the second run were collected. This ensured that the I/O cache was warmed so that the impact of the term-matching searches and document retrieval was minimized. As we have no logged queries, random words from the Danish iSpell dictionary were used. Only the results with hits were collected. Queries were issued sequentially with a 10ms wait between each. Time are full response times, measured through JMeter. In general, each query rarely returned more than 100 hits (which shows that we need better queries than random words).

Results

Exposed faceting on 1TB web archive

The very low spread is due to the searcher being fully warmed for the used queries, which also explain the 5 ms response times for non-faceted queries. Plain exposed faceting takes half a second, because all 315 million counters are iterated for each extraction. On the other hand, the small result sets means that sparse and packed only needs to process a few hundred entries. Notice that the performance of packed is practically identical to sparse.

Should anyone wonder: Unwarmed non-faceted searches for one or two random words varied between 10 and 1000 ms for this index.

Hold your horses

This all looks fantastic, but do remember that the hits/corpus_size ratio is extreme here. Our previous test with our library index is a better indicator of what can be expected and that “only” showed double the speed when using sparse.

Sparse facet counting on a real index

March 18, 2014 by

It was time for a little (nearly) real-world testing of a sparse facet counter for Lucene/solr (see Fast faceting with high cardinality and small result set for details). The first results are both very promising and somewhat puzzling.

The corpus was a copy of our production 50GB, 11M document index with library resources. The queries were taken randomly from the daily query log. Faceting was limited to just the author field, which has 7M unique values. The searcher was warmed up with hundreds of queries before testing. The tester ran with 2 threads against a 4 core machine and 500 searches were performed for each implementation.

Solr vs. exposed vs. sparse

In the graph, solr is standard Solr field faceting, exposed is our home brew (SOLR-2412) and sparse is our experimental home brew with sparse counting for small result sets. The red horizontal lines represents quartiles, with the max being replaced with the 95% for better graph scale. The horizontal black lines are medians.

The promising part is that the sparse counter has a much better median (16ms) than both solr (32ms) and exposed (29ms). Looking at the returned results, it seems clear that the vast majority of the queries only hits a fairly small part of the index, which benefits the sparse implementation. As they are real world queries, this is good news.

The performance of Solr field faceting and exposed is fairly similar, which is not very surprising as they work quite alike in this scenario. What is puzzling is that the maximum response time for both exposed and sparse is higher than solr‘s. The slowest response times not shown are 458ms for “have” with solr, 780ms for “to be or not to be” with exposed and 570ms for “new perspectives on native north america” with sparse. More testing is needed to determine if these are fluke results or if there is a general problem with worse outliers for exposed and sparse.

Update 20140320

Randomizing queries make for great experimentation but poor comparisons of implementations. Fixing the order and number of queries tested (29086, should anyone wonder) resulted in measurements without the strange outliers. The measurements were done in order exposed, sparse, packed, solr and nofacet. Maximum response time were a little above 2 seconds for all the facet calls and in all cases caused by the query ‘a‘.

Sparce faceting, fixed query order and count

Update 20140321

Introducing the JIRA issue SOLR-5894, with an accompanying patch for Solr 4.6.1. The patch only handles field cache based faceting on multi-valued fields right now, but that limitation was mainly to keep things simple. The sparse code is non-invasive and fits well within the way Solr performs field faceting. A quick experiment with 1852 fully warmed queries gave this:

author_7M_tags_1852_logged_queries_warmed

Update 20140324

Whoops. Forgot to include the baseline no-facets numbers. This changes the picture quite a bit. With a baseline median of 12 ms, sparse faceting overhead is only (24 ms – 12 ms) 12 ms and non-sparse is (36 ms – 12 ms) = 24 ms, which suspiciously (I triple checked the numbers) makes the sparse faceting overhead exactly half of non-sparse.

author_7M_tags_1852_logged_queries_warmed_with_base

Fast faceting with high cardinality and small result set

March 17, 2014 by

This is a follow-up to the idea presented  more than a year ago at http://sbdevel.wordpress.com/2013/01/23/large-facet-corpus-small-result-set/. It can be read independently of the old post.

The premise is simple: We have a Lucene/Solr index and we want to do some faceting. The facet field has high cardinality, which is a fancy way of saying that it has a lot of unique values. “A lot” is tens or hundreds of millions.

Old style counting

Ignoring all the little detail devils, getting the top 10 tags in a facet (sorting by count) is normally done like this:

// Init
counts = [#unique_tags]
// Search & collect
for each docID in hits {
  for each tagID in docID {
    counts[tagID]++
  }
}
// Extract
topTen = []
for tagID 0 ... #unique_tags {
  if (counts[tagID] > min(topTen)) {
   remove min(topTen) from topTen
   insert (tagID, counts[tagID]) in topTen
  }
}
// Result
create result from topTen
// Cleanup
for tagID 0 ... #unique_tags {
  counts[tagID] = 0
}

The init-part and the cleanup-part differs between implementations. Solr lets the JVM handle it by allocating new structures in the init-phase and dereferencing it in the cleanup-phase so that the garbage collector takes it. Our home brew SOLR-2412 caches the counters, which requires a cleanup after each run but has very stable memory and GC impact.

Notice how the extraction-phase and the cleanup-phase iterates all the tagIDs for the whole corpus, even if the result set itself is tiny? That is not very efficient. If we knew the number of unique tags for the result set in advance we could select between e.e. the big counter array and a small hash set for keeping track of the tags, but we do not have that luxury.

Track the counters

With Using Uninitialized Memory for Fun and Profit in mind, we create a new counter that is efficient for small result sets and with little speed penalty for large result sets. It is not too complicated:

TagCounterSparse {
  counts = [#unique_tags]
  updates = [#unique_tags / fraction]
  updatePointer = 0

  collect(tagID) {
    if (counts[tagID]++ == 0 && updatePointer != updates.length) {
      updates[updatePointer++] = tagID
    }
  }

  extract {
    topTen = []
    if (updatePointer != updates.length) {
      for each tagID in updates {
        if (counts[tagID] > min(topTen)) {
          remove min(topTen) from topTen
          insert (tagID, counts[tagID]) in topTen 
        }
      }
    } else {
      for each tagID in counts {
        if (counts[tagID] > min(topTen)) {
          remove min(topTen) from topTen
            insert (tagID, counts[tagID]) in topTen
        }
      }
    }
  }

  clear {
    if (updatePointer != updates.length) {
      for each tagID in updates {
        counts[tagID] = 0
      }
    } else {
      for each tagID in counts {
        counts[tagID] = 0
      }
    }
    updatePointer = 0
  }
}

What happens here is that a counter-tracker updates is introduced. When the count for a previously unseen tagID is increased, the tracker stores that tagID. If too many unique tagIDs are added, the tracker stops keeping track.

Extraction of top ten tags can be done in two ways. If there were too many unique tags in the result set, they are extracted exactly like the old implementation. If the result set was small enough, only the counts for the collected tagIDs are accessed. For really small result sets, this is blazingly fast.

Clearing is similar to extraction. Either it happens the old way or only the collected tagIDs are touched.

Considerations and gotchas

The sparse tag counter adds another layer of indirection, so if the result set is the whole corpus and if the updates is the same size as counts, all the phases will be slower than the old solution. We need to find out how large a part of counts we should keep track of. This is the fraction.

Another consideration is that the old iterate-all-tagIDs had the very nice property of accessing memory sequentially. The sparse solution is random access for each collected tagID, which is a lot slower. This heavily influences what the fraction should be.

Measure twice

An artificial Solr index was created. It had 20M documents and a single-value field with 20M unique values. Searches were performed with result sets consisting of every 2nd document, every 5th document, every 10th and so on. For each search the faceting time spend on collecting, extracting and clearing was measured. First results for each unique search were discarded and all searches repeated 5 times with the best results being extracted. All times are in milliseconds.

Old implementation

Every  Collect  Extract  Clear  Total
    2      527      153     12    692
    5      215       84     12    311
   10      112       56     12    180
   20       62       40     12    114
   30       44       33     12     89
   40       36       31     11     78
   50       29       30     12     71
  100       15       27     12     54
  200        8       25     12     45
  500        4       24     12     40
 1000        2       24     12     38
 5000        1       23     12     36

Notice how the extract time converges towards 20ms and the clear time is constant.

Sparse (#updates = #counts)

Every  Collect  Extract  Clear  Total
    2      544      453    160  1157
    5      221      183     63   467
   10      114       92     32   238
   20       63       46     16   125
   30       46       31     10    87
   40       38       23      8    69
   50       30       19      6    55
  100       15       10      3    28
  200        9        5      1    15
  500        3        1      0     4
 1000        1        0      0     1
 5000        1        0      0     1

Notice that the collect time is only a little worse than the old collect time, that the extract time is markedly better when the result set is every 40th document or less and that clear is also markedly faster for every 40th document or less.

This suggests that fraction should be 40. For this corpus, which is artificial. Your mileage will most certainly wary, but for this test we set it to 40.

sparse (#updates = #counts/40)

Every  Collect  Extract  Clear  Total
    2      545      151     12    708
    5      227       86     13    326
   10      118       56     13    187
   20       65       40     12    117
   30       47       35     13     95
   40       40       24      8     72
   50       31       19      6     56
  100       18       10      3     31
  200        9        5      2     16
  500        2        2      0      4
 1000        1        0      0      1
 5000        0        0      0      0

Notice how extraction and clear time is the same as for the old implementation for results sets with every 2th, 5th, 10th, 20th and 30th document. After that, extraction and clear times matches the sparse with #updates = #counts. It is the best of both worlds.

Conclusion

For this test, using a sparse tag collector, taking 1/40 more RAM than a standard collector, results in increasingly better relative performance the smaller the result set gets. The speed converges towards instant response. This comes at the price of slightly worse performance for large result sets.

Future

Right now this is just a proof of concept that resides in our local Summa-project (look for TestSingleSegmentOptimization.java). It needs to be tested on real world corpora and with logged queries before any real patches for Solr is to be considered. It will end up on SOLR-2412 though, when I find the time to upgrade it to Lucene/Solr 4.7.

SCAPE All-Staff Meeting 2014 in Póvoa de Varzim, Portugal

February 21, 2014 by

People in our own department at the State and University Library have complained, that we don’t talk enough, tell enough, write enough about this SCAPE project, we are working on, so here goes. I won’t tell you about the SCAPE project from the beginning. SCAPE stands for SCAlable Preservation Environments, and it has a project website, where you can read newsletters and deliverables and about the tools used and developed in SCAPE. What I will tell you about is the All-Staff Meeting 2014 in Póvoa de Varzim, Portugal. We are four SCAPE’rs from SB at this meeting. We had a long trip, that is over 10 hours, and we arrived after midnight Sunday, so I guess that would be Monday morning. The meeting started at 9 with General Session I: Welcome & Reports, which was more interesting than it sounds. It included introduction of new partners and status reports from the different subprojects and the TC. After the coffee break, we had the General Session II: Product Elevator Pitches. There were some really cool talks about the SCAPE CloudManager, about Sharing SCAPE software through virtualisation (using Vagrant) and a lot more. My favourite was of course Asgers talk about SB Repository Integration:

SB-ABR-ElevatorPith

And now I’ll try to speed things up a bit. We were staying at the Axis Vermar Conference & Beach Hotel. There were good lunches and a beach with magnificent waves, and a lot of parallel meetings with a lot of product presentations, some discussions and some road maps. After lunch I went to the Testbeds meeting, which concluded that all the Testbed partners have

  • Hadoop test platforms
  • Large scale test data sets (at least 1TB)
  • Test workflows

Coming up is

  • Finish development of workflows (preferably yesterday)
  • Performing experiments and evaluations
  • Writing blog posts and deliverables
  • And then there are the on-site demos to be announced in April and held in May

On Tuesday we had General Session III: Productisation. In this session I was involved in the xcorrSound discussion. We mostly discussed “Entry Points”, such as microsites, Github source, SCAPE-Project website, blogs and demo pages. We discussed the type/level of information they contain (i.e. what information different users are interested in), and especially that some people may be search via use case and want to know what tools will be useful rather than search for tools via their functionality. We decided to use microsites as central point of contact/information about a tool, and we should ensure microsites have a consistent look and feel. And we decided contact information is important, and I am now listed as maintainer of xcorrSound (already were in some places; will update others). We also talked about improvements and extensions to the xcorrSound demo page and a reference installation. After this I was at the PW (Planning and Watch) meeting. Then the PC.QA (Preservation Components . Quality Assurance) meeting. My work in this work package is also centered around xcorrSound. We mostly talked about the road plan, which looks roughly like this:

  • Checkpoint: Implementation of QA workflows (Hadoop version; done)
  • Checkpoint: Implementation and release packaging (Tool packaging; done)
  • Milestone: Release 3 benchmarking and validation tests (Todo)
  • Deliverable: Knowledge base of characterisation and migration errors (KB)
  • Deliverable: Quality Assurance Workflow, Release 3 + Release Report (Todo)

After this we had Product Forum I – Scalability with more interesting presentations. And then the SB group went to a nice fish restaurant by the beach. Wednesday morning there were more meetings, but none in which I was required, so I went for a walk on the beach and collected seashells (and got sand in my shoes). After lunch, we had the Product Forum II – Tools session, in which I had a presentation/demo on xcorrSound.

xcorrSoundDemoAllStaffFebruary2014V2

This will also be included in the upcoming in house demos at SB. The last session was “Reporting back and closure” and everybody was tired. The most interesting for me was from the TCC (Technical Coordination Committee) meeting: there will be a white paper on how we get from preservation plan to execution environment in SCAPE (some time). And there will be a SCAPE workshop and all-staff event in connection with the DL2014 conference in London in September. And now I better pack and get ready for going home. We are leaving at 10am Thursday, and we will probably be home around 1am Friday…

Watch the watchers when using AngularJS

February 12, 2014 by

The last couple of months we have been improving and porting our library search interface frontend to AngularJS. However great Angular’s two-way bindings are they can greatly impact performance and lately I have been looking into how we use them in the beta version of our search front end which is our first big AngularJS web app.

In this process I used the Batarang debug tool in Chrome to see what was going on in terms of registered watchers and performance and it quickly pointed out two potential problems.

1. Our translation plugin generates tons of watchers
First of it was noticeable that a large chunk of the registered watchers where attached to our translation plugin Angular-translate. Every time we use the translate directive a watcher is attached and with hundreds of bits of translated text on a page the watcher count quickly climbs. This behavior is per design as it is possible to change the language run-time. In our case we do a page refresh when the language is toggled and very few labels are toggled run-time so this seemed like a good place to optimize.

As the current version of Angular-translate does not have native support for unregistering watchers I looked at solutions like Bindonce.

Bindonce provides a set of attributes that allows the initial binding and after that your variable is “dead” and will not update on model change. Initial testing in Batarang with the Bindonce approach of course showed a significant decrease of watchers and thereby an increase in performance and best of all the app visually behaved exactly the same as before. Only drawback with the Bindonce plugin is that the templates need to be rewritten and the new code is not as clean is the old.
An example of the current approach (translate used as directive):

<div translate=’TRANSLATION.KEY’></div>

New Bindonce approach (translate used as filter):

<div bindonce bo-text=” ’TRANSLATION.KEY’ | translate ”></div>

Although this solves the performance issues we have with the translate module I would rather see a ‘stop watching’ functionality built into the plugin. Fortunately a lot of work is currently being put into this issue and it seems that the next release of angular-translate (1.2.0) will address this.

2. Unnecessary use of Angular watchers
Next performance issue related to watchers was our use of databindings in templates. Every Time you write {{foo}} you create a two-way binding in AngularJS and ‘foo’ is therefore watched. This is of course one of the core functionalities of the framework but you need to be aware that the watchers come with a performance penalty especially when the number of watchers grow. Every time $apply is called directly or implicitly it will force a $digest cycle and the watch-expressions are evaluated.

In our app the Batarang tool revealed that besides the translation issues a lot of watchers were registered to links in our faceting functionality. Every facet contains a link where the href is generated in the following way in the template:

<a href=’{{searchURL}}?query={{query}}{{currentFilter}}&filter={{facet.facetname}}:{{tag.name}}’>{{foo}}</a>

Each link has several data-bindings through {{}} and we have a lot of these links on the page. That is a lot of watchers. As these links do not change unless the template is run again they do not need to be watched and there would be a performance gain by creating these links without the watcher overhead. One way to do it would be to use a directive instead to generate the href:
In template:

<a href=”” facet>{{foo}}</a>

Directive:

.directive(‘facet’, [ function() {
return {
link: function(scope, elem, attr) {
var  href = /**Do your magic here**/
attr.$set('href', href);
} }
}]);

This significantly cuts down the amount of watcher expressions.

Another way around it would be to utilise the Bindonce plugin and do something like this:

<a bindonce bo-href=”searchURL + ‘?query=’ + query + currentFilter + ‘&filter=’ + facet.facetname + ‘:’ + tag.name“>{{foo}}</a>

This will give you zero watchers attached to the link. Not a particularly clean solution but a very nice and “free” performance enhancement as the watchers aren’t needed in this case. We could even optimize further by getting rid of the {{foo}} watcher by converting it to a Bindonce text attribute:

<a bindonce bo-href=”searchURL + ‘?query=’ + query + currentFilter + ‘&filter=’ + facet.facetname + ‘:’ + tag.name“ bo-text=”foo”></a>

As we dig deeper in the app I’m sure that even more unused watchers will turn up and be dealt with.

Closing
I know that there will be other even smarter approaches to the above examples, other plugins you can use to deal with watchers or you could even brew your own directives to handle it but the main point remains intact. Watch the watches and this also means investigating what your plugins are up to. Batarang is a good tool for this. Especially as a novice in AngularJS you have to consider how and when to use the powerful two-way binding options. Don’t be afraid of them just use with care. Don’t let them pile up and only use them when required. If used excessively it can ruin performance and render your Angular app very sluggish on less powerful clients. Here we are certainly learning this the hard way as we build our first large Angular app.

Using lineman and maven together

February 9, 2014 by

tl;dr: Want to use lineman and maven together? Get the lineman-maven-plugin.

At the State and University Library we have traditionally been using Java, JSP and related technologies for our web frontend development. Of course with a healthy dose of javascript in there as well. Our build tool has moved from ant to maven but as our use of javascript became more advanced and we started developing more single page apps it became clear that the advanced tools for javascript weren’t readily available in a Java world.The web development community now have a huge selection of tools all written in javascript and running on node.

We looked at some of the build tools available – from writing our own setup using grunt to the more complete frameworks like yeoman and lineman. In the end lineman was the one that suited us best with its relatively simple approach and focus on sensible defaults.

Integrating lineman with our existing maven setup proved frustrating. We tried using maven-exec-plugin and maven-antrun-plugin but neither of those could really give us a nice way of running the correct lineman tasks alongside our jetty server for local development as well as using lineman to build the javascript parts of our projects and integrating it into the final war file.

So in the end we developed a small maven plugin ourselves to make this integration easier. The result is the lineman-maven-plugin available under the Apache License 2.0 at github.

 

Large-scale digitization projects – quality and contractors

January 31, 2014 by

Our large-scale digitization project of newspapers from microfilm is just on the verge of going into production. Being technical lead on the process of ingesting and controlling quality of the digitized data has been a roller coaster of excitement, disillusionment, humility, admiration, despair, confidence, sleeplessness, and triumph.

I would like to record some of the more important lessons learned as seen from my role in the project. I hope this will help other organizations, when doing similar projects. I wish we had had these insights at the beginning of our project, but also recognize that some of these lessons can only be learned by experience.

Lesson 1: For a good end result, you need people in your organization that understand all parts of the process of digitization

At the beginning of our project, we had people in our organization who knew a lot about our source material (newspapers and microfilm) and people who knew a lot about digital preservation.

We did not have people who knew much about microfilm scanning.

We assumed this would be no problem, because we would hire a contractor, and they would know about the issues concerning microfilm scanning.

We were not wrong as such. The contractor we chose had great expertise in microfilm scanning. And yet it still turned out we ended
up with a gap in required knowledge.

The reason is, most scanning companies do not scan for preservation. They scan for presentation. The two scenarios entail two different sets of requirements. Our major requirement was to have a digital copy that resembled our physical copy as closely as possible. The usual set of requirements a scanning company gets from its customers, is to get the most legible files for the lowest cost possible. These two sets of requirements are not always compatible.

One example was image compression. We had decided on losslessly compressed images (in JPEG2000), which is more expensive than a lossy compression but avoids the graphic artifacts that lossy compression always leave, and can be a hassle in any post-processing or migration of the images. Using lossless image formats is an expensive choice when it comes to storage, but since we were scanning to replace originals we opted for the costly but unedited files.

Once we got our first images, though, inspection of the images showed definite signs of lossy compression artifacts. The files themselves were in a lossless format as expected, but the compressions artifacts were there all the same. Somewhere along the path to our lossless JPEG2000 images, a lossy compression had taken place. The contractor assured us that they used no lossy compressions. Not until we visited the contractor and saw the scanning stations did we find the culprit. It was the scanners themselves! It turned out that the scanner, when transferring the images from the scanner to the scanner processing software, used JPEG as an intermediary format. So in the end we got the costly lossless image format, but the artifacts from lossy compression as well. It was a pure lose/lose situation. And even worse, there was no obvious way to turn it off! We finally managed to resolve it, though, with three-way communication between us, the scanner manufacturer and the contractor. Luckily, there was a non-obvious way to avoid the JPEG transfer format. The way to turn it off was to change the color profile from “gray-scale” to “gray-scale (lossless)”.

As another example, we had in our tender the requirement that the images should not be post-processed in any way. No sharpening, no brightening, no enhancement. We wanted the original scans from the microfilm scanner. The reason for this was that we can always do post-processing on the images for presentation purposes, but once you post-process an image, you lose information that cannot be regained – you can’t “unsharpen” an image and get the original back. We had assumed this would be one of our more easily met requirements. After all, we were asking contractors to not do a task, not to perform one.

However, ensuring that images are not post-processed was a difficult task on its own. First there is the problem of communicating it at all. Scanner companies have great expertise in adjusting images for the best possible experience, and now we asked them not to do that. It was at first completely disruptive to communication, because our starting points were so completely different. Then there was the problem that some of the post-processing was done by the scanner software, and the contractor had no idea how to turn it off. Once again, it took three-way communication between the scanner manufacturer, our organization, and the contractor before we found a way to get the scanner to deliver the original images without post-processing.

The crucial point in both these examples is that we would not even have noticed all of this, if we hadn’t had a competent, dedicated
expert in our organization, analyzing the images and finding the artifacts of lossy compression and post processing. And in our case we only had that by sheer luck. We had not scheduled any time for this analysis or dedicated an expert to the job. We had drawn on expertise like this when writing the tender, so the requirements were clear and documented, and we had expected the contractor to honor these requirements as written. It was no one’s job to make sure they did.

However, one employee stepped up on his own initiative. He is an autodidact image expert, who originally was not assigned to the project at all. He took a look at the images and started pointing out the various signs of post processing. He wrote analysis tools and went out of his way to communicate to the rest of the digitization project how artifacts could be seen and histograms could expose the signs of post processing. It is uncertain that we would ever have had the quality of images we are getting from this project, if it had not been for his initiative.

Lesson 2: Your requirements are never as clear as you think they are

This one is really a no-brainer and did not come as a surprise for us, but it bears repeating.

Assuming you can write something in a tender and then have it delivered as described is an illusion. You really need to discuss and explain each requirement to your contractor, if you want a common understanding. And even then you should expect to have to clarify at any point during the process.

Also, in a large-scale digitization project, your source material is not as well-known as you think it is. You will find exceptions,
faults and human errors, that cause the source material to vary from the specifications.

Make sure you keep communication open with the contractor to clarify such issues. And make sure you have resources available to handle that communication.

Examples can be trivial – we had cases where metadata documents were delivered with the example text from our templates in
our specifications, instead of with the actual value it should contain. But they can also be much more complex – for instance we
asked our contractors to record the section title of our newspapers in metadata. But how do you tell an Indian operator where to find a
section title in a Danish newspaper?

Examples can also be the other way round. Sometimes your requirements propose a poorer solution than what the contractor can provide. We had our contractors suggest a better solution for recording metadata for target test scans. Be open to suggestions from your contractor, in some cases they know the game better than you do.

Lesson 3: Do not underestimate the resources required for QA

Doing a large-scale digitization project probably means you don’t have time to look at all the output you get from your contractor. The solution is fairly obvious when you work in an IT department: Let the computer do it for you. We planned a pretty elaborate automatic QA system, which would check data and metadata for conformity to specifications and internal consistency. We also put into our contract that this automatic QA system should be run by the contractor as well to check their data before delivery.

This turned out to be a much larger task than we had anticipated. While the requirements are simple enough, there is simply so much grunt work to do, that it takes a lot of resources to make a complete check of the entire specification. Communicating with the contractor about getting the tool to run and interpreting the results is an important part of getting value from the automatic QA tool. We have found that assumptions about technical platforms, input and output, and even communicating output of failed automatic QA are things that should not be underestimated.

However the value of this has been very high. It has served to clarify requirements in both our own organization and with our contractor, and it has given us a sound basis for accepting the data from our contractor.

In other, smaller digitization projects, we have sometimes tried to avoid doing a thorough automatic QA check. Our experience, in these cases, is that this has simply postponed finding mistakes, that could have been automatically detected to our manual QA spot checks. The net effect of this is that the time spent on manual QA and on requesting redeliveries has been greatly increased. So our
recommendation is to do thorough automatic QA, but also to expect this to be a substantial task to do.

Even when you have done thorough automatic QA, it does not replace the need for a manual QA process, but since you don’t have time to check every file manually, you will need to do a spot check. Our strategy in this case has been twofold: First we take a random sample of images to check, giving us a statistical model allowing us to make statements about the probability of undiscovered mistakes. Second we amend this list of images to check, with images that an automatic analysis tool marks as suspicious – for instance very dark, unexpected (that is: possibly wrong) metadata, very low OCR success rates etc.

We have had our contractor build a workflow system for doing the actual manual QA process for us. So given the input of random and
suspect pages, they are presented in an interface, where a QA tester can approve images, or reject them with a reason. A supervisor will then use the output from the testers to confirm the reception of the data, or request a redelivery with some data corrected.

Even though the contractor builds our manual QA interface, we still need to integrate with this system, and the resources required for this should not be underestimated. We opted to have the tool installed in our location, to ensure the data checked in the manual QA spot check was in fact the data that was delivered to us. If the manual QA spot check had been done at the contractor, in theory the data could have been altered after the manual QA spot check and before delivery. Communication concerning installation of the system and providing access to images for manual QA spot check also turned out to be time consuming.

 

In conclusion, in a large-scale digitization project, QA is a substantial part of the project, and must be expected to require considerable resources.

Lesson 4: Expect a lot of time to elapse before first approved batch

This lesson may be a corollary of the previous three, but it seems to be one that needs to be learned time and time again.

When doing time-lines for a digitization project, you always have a tendency to expect everything to go smoothly. We had made that assumption once again in this project, and as we should have expected, it didn’t happen.

Nothing went wrong as such, but during planning we simply didn’t take into account the time it takes to communicate about requirements when we did the planning. So when we received the first pilot batch, our time-line said we would go
into production soon after. This, of course, did not happen. What happened was that the communication process about what needs to be changed (in the specification or in the data) started. And then, after this communication process had been completed it took a while before new data could be delivered. And then the cycle starts again.

Our newly revised plan has no final deadline. Instead it has deadlines on each cycle, until we approve the first batch. We expect
this to take some time. The plan says we allow three weeks for the first cycle, then when problems seem less substantial, we reduce the cycle to two weeks. Finally we go to one week cycles for more trivial issues. And once we have finally approved the batch, we can go into production. Obviously, this pushes our original deadline back months, but it this is really how our plan should have been designed from the very beginning. So make sure your plans allow time to work out the kinks and approve the workflow, before you plan on going into production.

Lesson 5: Everything and nothing you learned from small-scale digitization projects still applies

Running small-scale digitization projects is a good way to prepare you for handling a large-scale digitization project. You learn a lot about writing tenders, communicating with contractors, what doing QA and ingest entails, how you evaluate scan results etc. It is definitely recommended to do several small-scale digitization projects before you go large-scale.

But a lot of the things you learned in small-scale digitization projects turn out not to apply when you go large-scale.

We are digitizing 32 million newspaper pages in three years. That means that every single day, we need to be able to receive 30.000
pages. With each page being roughly 15 MB, that’s close to half a terabyte a day. Suddenly a lot of resources usually taken for granted need to be re-evaluated. Do we have enough storage space just to receive the data? Do we have enough bandwidth? Can our in-house routers keep up with managing the data? How long will it actually take to run characterization of all these files? Can we keep up? What happens if we are delayed a day or even a week?

Also in small digitization projects, manually handling minor issues are feasible. Even doing a manual check of the delivered data is
feasible. In this case, if you want to do a check of everything, a full-time employee would only be able to spend about two thirds of a second per page if he or she wanted to keep up. So you really need to accept that you can not manually do anything on the whole project. Spot checks and automation are crucial.

This also means that the only ones who will see every page of your digitization project ever will probably be your contractor. Plan
carefully what you want from them, because you probably only have this one chance to get it. If you want anyone to read anything printed on the newspaper page, now is the time to specify it. If you want anything recorded about the visual quality of the page, now is the time.

Another point is that you need to be very careful what you accept as okay. Accepting something sub-optimal because it can always be
rectified later will probably be “never” rather than “later”. This needs to be taken into account every time a decision is made that
effects the entire digitization output.

Conclusion

Every kind of project has its own gotchas and kinks. Large-scale digitization projects are no exception.

Listed above are some of our most important lessons learned so far and seen from the perspective of a technical lead primarily working with receiving the files. This is only one small part of the mass digitization project, and other lessons are learned in different parts of the organization.

I hope these lessons will be of use to other people – and even to ourselves next time we embark on a mass digitization adventure. It has been an exhilarating ride so far, and it has taken a lot of effort to get a good result. Next step is the process of giving our users access to all this newspaper gold. Let the new challenges begin!

Webscale in danish

December 6, 2013 by

It does count as web scale if you actually index the web, right? The danish part of the web at least, which is not at all easy to define. Nevertheless Netarkivet currently has 372 TB of harvested data, which translates to roughly 10 billion (10,000,000,000) indexable entities. We have indexed about 1/100th of this in Solr as a test and that gave us an index of about 200 GB; the full corpus should thus take about 20 TB. The question is what kind of hardware it takes to search it all?

Soft requirements

Due to danish copyright and privacy laws, we cannot provide general access to the archive. This means that usage will be limited, which again means that the number of possible concurrent searches can be set to 1. Of course 2 people can search at the same time, but it is okay if that doubles the response times. We thus only need to look at latency – i.e. the time it takes from issuing a search until a result is received.

Faceting and grouping by URL would be very useful when searching the archive, but as the only current access possibility is “Give us the resource with this specific URL”, any search at all would be a great step forward.

Hardware possibilities

Our knee-jerk reaction to hardware for search is to order Solid State Drives, but with a limited amount of users, 20 TB of SSDs might be overkill. Luckily we do have an Isilon system with some terabytes to spare for testing as well as production. The Isilon is networked storage and it is very easy to expand so it would be great if that could deliver the needed storage performance. Blade servers with virtual machines deliver the CPU horse power, but as we shall see it really is all about IOPS (In/Out Operations Per Second).

SolrCloud

Building one single 20 TB index would be an intriguing endeavour but alas, only 2 billion documents/segment in Lucene/Solr. Besides, this would be quite devastating in case of index errors. Furthermore there is a high chance that it would provide lower performance compared to multiple indexes as a lot of the internal seeking operations for a single index are sequential.

Some sort of distribution seems to be the way forward and SolrCloud is first candidate (sorry, ElasticSearch, maybe next time?). We need to estimate what kind of iron it takes to handle the 20TB and we need to estimate a proper shard-count vs. shard-size ratio. Is 100 shards of 200GB better than 10 shards of 2TB? Since our content is ever-growing and never-updating we have the luxury of being able to optimize and freeze our shards as we build them one at a time. No frustrating “index time vs. search time”-prioritization here.

Numbers, please

Real Soon Now, promise. First we need to determine the bottleneck for our box. CPU or IO? To do so we measure cloud sizes vs. number of concurrent searches. By taking the response time and dividing by (#threads * #shards), we get the average response time for the individual shards. If that number plateaus independently of #threads as well as #shards, we have the maximum performance for our box.

For query testing we do an OR-query with 2-4 random danish words and light faceting. This should be representative of the real-world queries, at least for IO. As we cannot flush the cache on the Isilon box (Operations would kill us if we managed to do so), each test is preceded by a short (~100 queries) warmup run and a hope that the numbers will be consistent. Collected statistics are based on Solr-reported Query-Times and collected for all threads. We checked and that information is very close to full response times.

Confused? No worries, it gets a lot clearer when we have numbers.

Spinning numbers!

The Isilon system used spinning drives with a fair amount of combined SSD and RAM caching in front. Remember that reported times are response_time/(#threads * #shards) ms.

Testing with 200GB (75M documents) indexes, Median (50% percentile), remote Isilon

  Threads  
Shards 1 2 4 8 16 32 avg
1 28 10 4 2 1 1 7
2 99 47 35 34 34 34 47
3 77 42 36 38 41 45 47
4 67 47 47 54 52 56 54
5 67 57 54 54 53 56 57
6 64 56 53 58 59 58 58
7 63 56 53 50 55 55 55
8 72 60 58 60 62 61 62
9 71 60 55 57 58 56 60
10 71 58 57 57 56 54 59
11 74 65 61 60 58 62 63
12 70 61 58 61 60 58 61
12 73 63 65 67 70 69 68
13 83 66 67 74 70 65 71
14 78 71 68 71 72 80 73
15 85 75 70 70 70 78 75
16 74 68 68 67 66 75 70
17 78 68 70 70 70 79 72
16 77 71 69 70 70 80 73

Testing with 200GB (75M documents) single segment indexes, 99% percentile, remote Isilon

  Threads  
Shards 1 2 4 8 16 32 avg
1 157 55 33 25 14 39 54
2 230 104 87 85 69 68 107
3 208 98 93 82 77 84 107
4 166 111 109 104 88 95 112
5 152 129 110 93 90 94 112
6 128 130 110 100 96 98 110
7 125 124 98 85 91 92 103
8 128 113 99 104 106 95 107
9 151 107 97 94 91 99 106
10 121 106 97 100 88 90 100
11 136 109 107 102 94 109 109
12 117 109 100 100 94 95 102
12 139 108 139 107 143 174 135
13 140 111 158 121 116 112 126
14 134 119 108 122 131 119 122
15 154 152 123 120 128 116 132
16 121 109 107 100 112 104 109
17 124 115 111 112 122 117 117
16 126 119 112 104 122 118 117

Testing with 420GB (174M documents) single segment indexes, Median (50% percentile), remote Isilon

  Threads  
Shards 1 2 4 8 16 32 avg
1 17 9 6 3 2 1 6
2 81 50 42 42 41 44 50
3 91 52 44 45 44 51 54
4 83 57 53 54 52 56 59
5 77 59 54 54 55 56 59
6 79 59 55 53 54 53 59
7 76 61 57 57 55 55 60
8 77 61 56 58 59 60 62
9 75 64 64 64 63 63 66

Testing with 420GB (174M documents) single segment indexes, 99% percentile, remote Isilon

  Threads  
Shards 1 2 4 8 16 32 avg
1 154 85 70 38 18 15 63
2 255 126 105 97 78 72 122
3 207 165 103 86 79 79 120
4 181 118 102 98 92 82 112
5 164 114 100 91 88 83 107
6 168 123 95 90 80 78 106
7 146 117 93 80 81 85 100
8 150 108 98 91 83 92 104
9 134 114 102 103 97 107 109

Isilon observations

The average response time/shard/thread gets markedly better when going from 1 to 2 threads. This makes sense, as a single-thread is alternating between CPU-wait and IO-wait, thus not utilizing both resources fully. More threads does not improve performance by much and in a lot of cases, 32 threads are slower than 16 threads.

Our goal was to estimate the hardware requirements for the full 20TB index files. We see from the tables above that the average response times for the individual shards to grow with the number of shards and the number of threads, but not critically. The average of the medians for 8 shards of 200GB is 62 ms, while it is 73 ms for 16 shards; this is a 17% increase for double the amount of data. With a 20TB estimate, we would need 100 shards. Expecting about 20% increase for each doubling of the shard count, the average shard response time would be approximately 115 ms for a total of 11,500 ms.

This might be somewhat optimistic as the cache works significantly better on smaller datasets and because the average median is lower than the single thread median, which is more telling for the expected usage pattern.

Looking at the average median for 420GB shards, we have 59 ms for 4-5 shards and 66 ms for 9 shards, or an response time increase of only 11% for a doubling of the index size. Extrapolating from this, the full corpus of about 50 shards would have an average shard response time of approximately 85 ms for a total of 4,250 ms. As this is less than half of the estimated time for the 200GB shards, this seems suspiciously low (or the estimation for the 200GB shards is too high).

Those were medians. Repeating for 99 percentile we get 14,500 ms for 200GB shards and 5,500 ms for 420GB. Again, the 420GB measurements seems to be “too good”, compared to the 200GB ones.

Solid numbers

We do not have terabytes of unused space on Solid State Drives lying around, but a temporary reorganizing on our test machine freed just above 400GB. Enough for testing 2 * 200GB. It is important to keep in mind that this storage is both local & SDD as opposed to the Isilon box which is remote & spinning drives.

Note that the two rows in each test are for the same number of shards (2). This was done to see whether warming/caching improved response times.

Testing with 200GB (75M documents) indexes, Median (50% percentile), local SSD

  Threads  
Shards 1 2 4 8 16 32 avg total GB avg full
2 20 8 4 3 2 3 7 400 13
2 16 8 4 3 2 2 6 400 12

Testing with 200GB (75M documents) single segment indexes, 99% percentile, local SSD

  Threads  
Shards 1 2 4 8 16 32 avg total GB avg full
2 64 35 31 14 10 8 27 400 54
2 36 30 26 10 8 13 21 400 41

Solid observations

As opposed to spinning drives, the performance continues to improve markedly with threading up to around 16 threads. Keeping in mind that this is on a 16 core machine, this tells us that we do not have enough data to be sure whether the SSDs or the CPUs are the main bottleneck.

Note to self: It would be interesting to try and limit the number of active CPUs by using taskset.

One thing we can see is that the SSD-solution seems to be much better performing than the spinning drives one. But how much is much?

Putting it all together

With only enough free SSD for 2*200GB shards, we are forced to project the final load by threading. Our maintenance guys very temporarily freed 400GB of SSD on the Isilon box so we were able to see whether the local vs. remote parameter had a lot of influence on performance.

2*200GB shards

Threading with 2*200GB shards

Keep in mind that this is a very loose projection and especially keep in mind that bigger shards are relatively faster while bigger test data are relatively slower. Still, the conclusion seems very clear: Using SSDs as storage is a whole other level of performance, when only a small part of the index can be cached in RAM.

Sweet sweet hardware

We all know that SSDs are expensive. Except that they are not. Yes they cost 10 times as much as spinning drives when looking at capacity only, but when the whole point about SSDs is speed, it is plain stupid just to look at capacity.

As our analysis above shows, scaling to satisfactory response times (median < 1 second in this case) with spinning drives would require 5-10 times the speed we currently get from Isilon. As the Isilon box is already using RAID for speed improvement (and crash recovery), this seems like quite a task. Compensating for spinning drives by increasing the amount of RAM for caching is unrealistic with 20TB of index.

It is simply cheaper and faster to base our search machine on SSDs. We can handle crashes by mounting each drive separately and using shards the same size as each drive. If a drive fails, the corresponding shard backup (which will probably be located on Isilon) takes over until the shard has been copied to a new SSD. Our hardware guys suggests

  • 1x 24×2.5″ Intel JBOD2000 2U RACK, Dual PSU, dual backplane
  • 24x 1TB Samsung 840 EVO
  • 1x Intel RS25GB008 JBOD controller with 8 external ports
  • A server to hold the controller

Estimated price 100-200,000 danish kroner or 20-40,000 USD, depending on RAM (suggested amount: 256 or 512GB as we expect to to heavy faceting and grouping). On top of that we need a dedicated indexing machine, which can also be used for other tasks.

Quack – an ALTO viewer

November 28, 2013 by

Never heard of ALTO (Analyzed Layout and Text Object)? Fear not, it is a very simple representation of structured text from physical pages, and I mean that in a positive way. Its primary use is to hold OCRed text. A piece of ALTO XML is a list of boxes, representing pages. Each page is a list of boxes, representing blocks. Blocks can, among other things, be a list of boxes, representing lines. Each line is a list of boxes, representing words. Each word is just a word. No more Chinese boxes. For all its simplicity, there is a curious lack of simple viewers for the format. That’s where Quack comes in.

Quack displaying one of the sample images

Quack displaying one of the sample images

Quack is a bash hack from yours truly that grew out of control. The focus is small-scale quality assurance: A few hundred images is ok, a few thousand is manageable. It traverses a folder structure of images with corresponding ALTO-files and creates HTML-pages containing zoomable images with ALTO structure overlays. The pages can be viewed directly from the file system or (of course) put on a web server. Features are

  • OpenSeadragon used for smooth zoom & pan of images of arbitrary size (arbitrary meaning at least up to 50 gigapixel in this context)
  • Overlays
    • Connected blocks for inspecting segmentation quality
    • Grid for inspecting skewing and rotation
    • Burned out high-lights and low-lights
  • OCR-text on mouse over
  • Histogram for each image
  • Darkest and brightest greyscale value with percentage (e.g. 12% of the images is absolute black)
  • Thumbs view for folders
    • Burned out high-lights and low-lights overlays in thumbs view
  • Iterative updates of the display files (additions does not require a full rebuild)
  • Works with a fairly standard Linux setup, except for deepzoom which is optional but recommended for large images. Should work under Cygwin but this is not tested
  • Dog slow for big ALTO files as bash is heavily misused for text processing and HTML templating

Quack is Apache 2.0 Open Source. Clone it from https://github.com/tokee/quack and give it a spin. It even has sample images to get you started.


Follow

Get every new post delivered to your Inbox.