Automated improvement of search in low quality OCR using Word2Vec

This abstract has been accepted for Digital Humanities in the Nordic Countries 2nd Conference, http://dhn2017.eu/

In the Danish Newspaper Archive[1] you can search and view 26 million newspaper pages. The search engine[2] uses OCR (optical character recognition) from scanned pages but often the software converting the scanned images to text makes reading errors. As a result the search engine will miss matching words due to OCR error. Since many of our newspapers are old and the scans/microfilms is also low quality, the resulting OCR constitutes a substantial problem. In addition, the OCR converter performs poorly with old font types such as fraktur.

One way to find OCR errors is by using the unsupervised Word2Vec[3] learning algorithm. This algorithm identifies words that appear in similar contexts. For a corpus with perfect spelling the algorithm will detect similar words synonyms, conjugations, declensions etc. In the case of a corpus with OCR errors the Word2Vec algorithm will find the misspellings of a given word either from bad OCR or in some cases journalists. A given word appears in similar contexts despite its misspellings and is identified by its context. For this to work the Word2Vec algorithm requires a huge corpus and for the newspapers we had 140GB of raw text.

Given the words returned by Word2Vec we use a Danish dictionary to remove the same word in different grammatical forms. The remaining words are filtered by a similarity measure using an extended version of Levenshtein distance taking the length of the word and an idempotent normalization taking frequent one and two character OCR errors into account.

Example: Let’s say you use the Word2Vec to find words for banana and it returns: hanana, bananas, apple, orange. Remove bananas using the (English) dictionary since this is not an OCR error. For the three remaining words only hanana is close to banana and it is thus the only misspelling of banana found in this example. The Word2Vec algorithm does not know how a words is spelled/misspelled, it only uses the semantic and syntactic context.

This method is not an automatic OCR error corrector and cannot output the corrected OCR. But when searching it will appear as if you are searching in an OCR corrected text corpus. Single word searches on the full corpus gives an increase from 3% to 20% in the number of results returned. Preliminary tests on the full corpus shows only relative few false positives among the additional results returned, thus increasing recall substantially without a decline in precision.

The advantage of this approach is a quick win with minimum impact on a search engine [2] based on low quality OCR. The algorithm generates a text file with synonyms that can be used by the search engine. Not only single words but also phrase search with highlighting works out of the box. An OCR correction demo[4] using Word2Vec on the Danish newspaper corpus is available on the Labs[5] pages of The State And University Library, Denmark.

[1] Mediestream, The Danish digitized newspaper archive.
http://www2.statsbiblioteket.dk/mediestream/avis

[2] SOLR or Elasticsearch etc.

[3] Mikolov et al., Efficient Estimation of Word Representations in Vector Space
https://arxiv.org/abs/1301.3781

[4] OCR error detection demo (change word parameter in URL)
http://labs.statsbiblioteket.dk/dsc/ocr_fixer.jsp?word=statsminister

[5] Labs for State And University Library, Denmark
http://www.statsbiblioteket.dk/sblabs/

 

Advertisements
Posted in Uncategorized | 2 Comments

70TB, 16b docs, 4 machines, 1 SolrCloud

At Statsbiblioteket we maintain a historical net archive for the Danish parts of the Internet. We index it all in Solr and we recently caught up with the present. Time for a status update. The focus is performance and logistics, not net archive specific features.

Hardware & Solr setup

Search hardware is 4 machines, each with the same specifications & setup:

  • 16 true CPU-cores (32 if we count Hyper Threading)
  • 384GB RAM
  • 25 SSDs @ 1TB (930GB really)

Each machine runs 25 Solr 4.10 instances @ 8GB heap, each instance handling a single shard on a dedicated SSD. Except for machine 4 that only has 5 shards, because it is being filled. Everything coordinated by SolrCloud as a single collection.

netarchive_search_overview_20161129

Netarchive SolrCloud search setup

As the Solrs are the only thing running on the machines, it follows that there are at least 384GB-25*8GB = 184GB free RAM for disk cache on each machine. As we do not specify Xms, this varies somewhat, with 200GB free RAM on last inspection. As each machine handles 25*900GB = 22TB index data, the amount of disk cache is 200GB/22TB = 1% of index size.

Besides the total size of 70TB / 16 billion documents, the collection has some notable high-cardinality fields, used for grouping and faceting:

  • domain & host: 16 billion values / 2-3 million unique
  • url_norm: 16 billion values / ~16 billion unique
  • links: ~50 billion values / 20 billion+ unique

Workload and performance

The archive is not open to the public, so the amount of concurrent users is low, normally just 1 or 2 at a time. There are three dominant access patterns:

  1. Interactive use: The typical request is a keyword query with faceting on a 4-6 fields (domain, year, mine-type…), sometimes grouped on url_norm and often filtered on one or more of the facets.
  2. Corpus probing: Batch extraction jobs, such as using the stats component for calculating the size of all harvested material, for a given year, for all harvested domains separately.
  3. Lookup mechanism for content retrieval: Very experimental and used similarly to CDX-lookups + Wayback display. Such lookups are searches for 1-100 url_field:url pairs, OR’ed together, grouped on the url_field and sorted on temporal proximity to a given timestamp.

Due to various reasons, we do not have separate logs for the different scenarios. To give an approximation of interactive performance, a simple test was performed: Extract all terms matching more that 0.01% of the documents, use those terms to create fake multi-term queries (1-4 terms) and perform searches for the queries in a single thread.

compare_no-facets_sparse-facets_20161130

Non-faceted (blue) and faceted (pink) search in the full net archive, bucketed by hit count

The response times for interactive use lies within our stated goal of keeping median response times < 2 seconds. It is not considered a problem that queries with 100M+ hits takes a few more seconds.

The strange low median for non-faceted search in the 10⁸-bucket is due to query size (number of terms in the query) impacting search-speed. The very fast single-term queries dominates this bucket as very few multi-term queries gave enough hits to land in the bucket. The curious can take a look at the measurements, where the raw test result data are also present.

Morale: Those are artificial queries and should only be seen as a crude approximation of interactive use. More complex queries, such as grouping on billions of hits or faceting on the links-field, can take minutes. The so-far-discovered extremely worst-case is 30 minutes.

Secret sauce

  1. Each shard is fully optimized and the corpus is extended by adding new shards, which are build separately. The memory savings from this are large: No need for the extra memory needed for updating indexes (which requires 2 searchers to be temporarily open at the same time), no need for large segment→ordinal maps for high-cardinality faceting.
  2. Sparse faceting means lower latency, lower memory footprint and less GC. To verify this, the performance test above was re-taken with vanilla Solr faceting.
compare_sparse-facets_solr-facets_20161130

Vanilla Solr faceting (blue) and sparse faceting (pink) search in the full net archive, bucketed by hit count

Lessons learned so far

  1. Memory. We started out with 256GB of RAM per machine. This worked fine until all the 25 Solr JVMs(machine had expanded up to their 8GB Xmx, leaving ~50GB or 0.25% of the index size free for disk cache. At that point the performance tanked, which should not have come as a surprise as we tested this scenario nearly 2 years ago. Alas, quite foolishly we had relied on the Solr JVMs not expanding all the way up to 8GB.Upping the memory per machine to 384GB, leaving 200GB or 1% of index size free for disk cache ensured that interactive performance was satisfactory.
    An alternative would have been to lower the heap for each Solr. The absolute minimum heap for our setup is around 5GB per Solr, but that setup is extremely vulnerable to concurrent requests or memory heavy queries. To free enough RAM for satisfactory disk caching, we would have needed to lower the heaps to 6GB, ruling out faceting on some of the heavier fields and in general having to be careful about the queries issued. Everything works well with 8GB, with the only Out Of Memory incidents having been due to experiment-addicted developers (aka me) issuing stupid requests.
  2. Indexing power: Practically all the work on indexing is being done in the Tika-analysis phase. It took about 40 CPU-core years to build the current Version 2 of the index; in real-time it took about 1 year. Fortunately the setup scales practically linear, so next time we’ll try to allocate 12 power houses for 1 month instead of 1 machine for 12 months.
  3. Automation: The current setup is somewhat hand-held. Each shard is constructed by running a command, waiting a bit less than a week, manually copying the constructed shard to the search cloud and restarting the cloud (yes, restart).In reality it is not that cumbersome, but a lot of time was wasted with the indexer having finished, with noone around to start the next batch. Besides, the excessively separated index/search setup means that the content currently being indexed into an upcoming shard cannot be searched.

Looking forward

  1. Keep the good stuff: We are really happy about the searchers being non-mutable and on top of fully optimized shards.
  2. Increase indexing power: This is a no-brainer and “only” a question of temporary access to more hardware.
  3. Don’t diss the cloud: Copying raw shards around and restarting the cloud is messy. Making each shard a separate collection would allow us to use the collections API for moving them around and an alias to access it all as a single collection.
  4. Connect the indexers to the searchers: As the indexers only handle a single shard at a time, they are fairly easy to scale so that they can also function as searchers for the shards being build. The connection is trivial to create if #3 is implemented.
  5. Upgrade the components: Solr 6 will give us JSON faceting, Streaming, SQL-ish support, graph traversal and more. These aggregations would benefit both interactive use and batch jobs.We could do this by upgrading the shards with Lucene’s upgrade tool, but we would rather perform a whole re-index as we have also need better data in the index. A story for another time.

 

Posted in Hacking, Low-level, Performance, Solr, Statsbiblioteket, Uncategorized | 6 Comments

Prototype demo for OCR postfix in Danish Newspapers

In The Danish Newspaper Archive you can search in 25million newspaper pages and view the pages. The search engine uses OCR (optical character recognition) from scanned pages but often the software reading the text from the scanned images makes reading errors. As a result of this the search engine will miss matching words due to OCR error. Since many of our newspapers are old and quality of the scans/microfilms is not very good combined with OCR software has problems old fonts types, the bad OCR constitutes a substantial problem.

One way to find these OCR errors is using the Word2Vec algorithm that I have written about before. The algorithm detects words that appear in similar contexts. So for a corpus with perfect spelling the algorithm will detect similar words,synonyms,conjugations,declensions etc. But in the case of a corpus with OCR errors the Word2Vec algorithm will also find the
misspellings of a given word either from bad OCR or in some case journalists. A given misspelled word appear the the exactly same contexts for all it misspellings. For this to work the Word2Vec algorithm requires a huge corpus and for the newspapers we had 140GB raw text. So this is probably also the largest word2vec index ever build on a Danish corpus.

Given the list of words return by Word2Vec we then use a Danish dictionary to remove the same word in different forms which is not a OCR error. On the remaining words you just see if the words are close to enough comparing characters to be identified as a misspelling.
Examle: Lets say you use the Word2Vec to find words for ‘banana’ and it returns: hanana, bananas,apple, orange,
You remove bananas using the (english) dictionary since this is not an OCR error. For the three remaining word only ‘hanana’ is close to ‘banana’ and it thus the only mispelling of banana found in this example. Remember the Word2Vec algorithm does care how the words are spelled/misspelled it only uses the semantic context of the words.

You can play with the Word2Vec index on the Danish Newspapers here:
http://labs.statsbiblioteket.dk/dsc/ (remember to select the newspaper corpus)

And this page shows how the dictionary is used to find misspellings:
http://labs.statsbiblioteket.dk/dsc/ocr_fixer.jsp?word=statsminister
(change the last words in the url – Danish only sorry…)

Running this algorithm on 1000 random words takes 8 hours (using 20 CPUs) and fixes 84mio. OCR errors though a very few of them are false positives and not OCR errors, but this is very rare compared to true OCR errors. This last step to maximize bad OCR errors and minimize false positives is still in progress… 🙂

The newspaper Archive: http://www2.statsbiblioteket.dk/mediestream/avis

 

Posted in Uncategorized | Leave a comment

2D visualization of high dimensional word embeddings

In this blog post I tried to make an method for a computer to  read a text and analyse the characters and then make a 2D visualization of the similarity of the characters. To achieve this I am using the word2vec algorithm and then making a distance matrix of all mutual distances and fitting them into a 2D plot. The three texts I used was

  • All  3 Lord of The Ring books
  • Pride and Prejudice + Emma by Jane Austen
  • A combined text of 35.000 free english Gutenberg e-books

Word2Vec is an algorithm invented by Google researchers in 2013. Input it a text which has been preprocessed I will explain later. The algorithm  then extract all words and maps each word to a multidimensional vector of typical 200 dimensions. Think of a the quills of a hedgehog where each quill is a word, except it is in more than 3 dimensions. What is remarkable about the algorithm is that it captures some of the contexts of the words and this is reflected in the multidimensional vectors. Words that are somewhat similar are very close in this vector space, where ‘close’ is measured by the angle between two vectors. Furthermore the relative positions of two words also captures a relation between words. A well known example is that the distance vector from ‘man’ to ‘king’ is almost identical to the distance vector from ‘woman’ to ‘queen’. Using this information you are able to predict the word ‘queen’ given the three words <man,king> <woman,?>. It is far from obvious to understand why the algorithm  reflects this behaviour in the vector space and I have not fully understood the algorithm yet. Before you can use the word2vec algorithm you have to remove all punctuations and split the sentences into separate lines and lowercase the text. The splitting into sentences is not just splitting whenever you meet a ‘.’ character. For instance Mr. Anderson should not trigger a split.

First I  create the multidimensional representation of the words using word2vec which is just all the words (like a dictionary) and the vector for that word.  Next step I manual input the characters (or words in fact.) that I want to create the visualization for and calculate the distance matrix for all mutual distances by taking the cosinus of the angle between the vectors. This gives a value between -1 and +1 which I then shifts to 0 to 2 so I have a positive distance between the words. Finally I take this distance matrix and turn it into a 2D visualization trying to keep the distances as ‘close a possible’ in the 2D visualization as in the vectorspace. Of course this is not possible generally. Even for 3 vectors this can be impossible (if the Triangle inequality is broken). I create the plot by dividing the 2D into a grid and place the first character in the middle. The next character is also easy to place in on the circle with the radius of the distance. For the following characters I place one a time in the grid that minimize the sum of the distance-errors to the already placed characters in the grid. This is a greedy algorithm that priorities the first characters added to the plot and this I why the plotted the main characters first and have the other characters place them accordingly to these.

I tried to use the Stanford entity extraction tool to both extract locations and persons from a given text, but there was way too many false positives, thus I had the manually feed the algorithm the characters. To do it perfect I should had replaced a character metioned with  multiple names by a single same. Gandalf, Gandalf the Grey, Mithrandir is the same character etc. but I did not perform this substitution. So when I select the character Gandalf I only get the context where he is mentioned as Gandalf and not Mithrandir.

And now lets see some of the 2D visualizations!

Lord of the Rings

0) Frodo
1) Sam
2) Gandalf
3) Gollum
4) Elrond
5) Saruman
6) Legolas
7) Gimli
8) Bilbo
9) Galadriel

______________________________________________________________________
___________________________________________3__________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_______________________________________________1______________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
5_____________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_______________________________________________0__________________8___
_______________________2______________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_______________________________________6______________________________
______________________________________________________________________
______________________________________________________________________
__________________________________7___________________________________
______________________________________________________________________
______________________________________________________________________
_______________________4______________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_____________________________9________________________________________
______________________________________________________________________

 

Jane Austen: Pride and Prejudice +Emma

0)Elizabeth ( Elizabeth Bennet)
1)Wickham (George Wickham)
2)Darcy (Mr. Darcy)
3)Bourgh (Lady Catherine de Bourgh)
4)Lydia (Lydia Bennet)
5)William (Mr. William Collins)
6)Kitty (Catherine “Kitty” Bennet)
7)Emma (Emma Woodhouse)
8)Knightley (George Knightley)

__________________________________________________________________________
_________________________________________8________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
___________________________1__________2___________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________7_____
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
_____________________________________________________0____________________
________________3__________4______________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
_______________________________________________________5__________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
___________________6______________________________________________________
__________________________________________________________________________

 

35.000 English Gutenberg books

In this plot instead of characters I selected different animals

0) Fish
1) Cat
2) Dog
3) Cow
4) Bird
5) Crocodile
6) Donkey
7) Mule
8) Horse
9) Snake
__________________________________________________________
_____________________________________________7____________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
____________________________________________6_____________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
________________________________________________________8_
__________________________________________________________
__________________________________________________________
________________________________2_________________________
__________________________________________________________
_______________1__________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
___________________________________________________3______
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__4_______________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________9_______________________________
__________________________________________________________
______________5___________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________0_______________________________
__________________________________________________________

Conclusion

Does the 2D plotting catch some of the essence of the words/characters from the books? Or does it look like they are just thrown in random on the plane?

I look forward to your your conclusion! For the Gutenberg animals plot I believe the visualization really does match how I see the animals. Fish, reptiles are grouped together and in the upper left corner we have the horses family of animals. For the Jane Austin it is also interesting that the character Emma match Elizabeth most though there are from two different books but somewhat identical main characters.

Posted in Uncategorized | 1 Comment

CDX musings

This is about web archiving, corpus creation and replay of web sites. No fancy bit fiddling here, sorry.

There is currently some debate on CDX, used by the Wayback Engine, Open Wayback and other web archive oriented tools, such as Warcbase. A CDX Server API is being formulated, as is the CDX format itself. Inspired by a post by Kristinn Sigurðsson, here comes another input.

cdx_02

CDX components, diagram by Ulrich Karstoft Have

CDX Server API

There is an ongoing use case-centric discussion of needed features for a CDX API. Compared to that, the CDX Server API – BETA seems a bit random. For example: A feature such as regexp-matching on URLs can be very heavy on the backend and open op for easy denial of service (intentional as well as unintentional). It should only be in the API if it is of high priority. One way of handling all wishes is to define a core API with the bare necessities and add extra functionality as API extensions. What is needed and what is wanted?

The same weighing problem can be seen for the fields required for the server. Kristinn discusses CDX as format and boils the core fields down to canonicalized URL, timestamp, original URL, digest, record type, WARC filename and WARC offset. Everything else is optional. This approach matches the core vs. extensions functionality division.

With optional features and optional fields, a CDX server should have some mechanism for stating what is possible.

URL canonicalization and digest

The essential lookup field is the canonicalised URL. Unfortunately that is also the least formalised, which is really bad from an interoperability point of view. When the CDX Server API is formalised, a strict schema for URL canonicalisation is needed.

Likewise, the digest format needs to be fixed, to allow for cross-institution lookups. As the digests do not need to be cryptographically secure, the algorithm chosen (hopefully) does not become obsolete with age.

It would be possible to allow for variations of both canonicalisation and digest, but in that case it would be as extensions rather than core.

CDX (external) format

CDX can be seen as a way of representing a corpus, as discussed on the RESAW workshop in december 2015.

  • From a shared external format perspective, tool-specific requirements such as sorting of entries or order of fields are secondary or even irrelevant.
  • Web archives tend to contain non-trivial amounts of entries, so the size of a CDX matters.

With this in mind, the pure minimal amount of fields would be something like original URL, timestamp and digest. The rest is a matter of expanding the data from the WARC files. On a more practical level, having the WARC filename and WARC offset is probably a good idea.

The thing not to require is the canonicalized URL: It is redundant, as it can be generated directly from the original URL, and it unnecessarily freezes the canonicalization method to the CDX format.

Allowing for optional extra fields after the core is again pragmatic. JSON is not the most compact format when dealing with tables of data, but it has the flexibility of offering entry-specific fields. CDXJ deals with this, although it does not specify the possible keys inside of the JSON blocks, meaning that the full CDX file has to be iterated to get those keys. There is also a problem of simple key:value JSON entries, which can be generically processed, vs. complex nested JSON structures, which requires implementation-specific code.

CDX (internal) format

Having a canonicalized and SURTed URL as the primary field and having the full CDX file sorted is an optimization towards specific tools. Kris touches lightly on the problem with this by suggesting that the record type might be better positioned as the secondary field (as opposed to timestamp) in the CDX format.

It follows easily that the optimal order or even representation of fields depends on tools as well as use case. But how the tools handle CDX data internally really does not matter as long as they expose the CDX Server API correctly and allows for export in an external CDX format. The external format should not be dictated by internal use!

CDX (internal vs. external) format

CDX is not a real format yet, but tools do exist and they expects some loosely defined common representation to work together. As such it is worth considering if some of the traits of current CDXes should spill over to an external format. Primarily that would be the loosely defined canonicalized URL as first field and the sorted nature of the files. In practice that would mean a near doubling of file size due to the redundant URL representation.

CDX implementation

The sorted nature of current CDX files has two important implications: Calculating the intersections between two files is easy and lookup on primary key (canonicalised URL) can algorithmically be done in O(log n) time using binary search.

In reality, simple binary search works poorly on large datasets, due to the lack of locality. This is worsened by slow storage types such as spinning drives and/or networked storage. There are a lot of tricks to remedy this, from building indexes to re-ordering the entries. The shared theme is that the current non-formalised CDX files are not directly usable: They require post-processing and extensions by the implementation.

The take away is that the value of a binary search optimized external CDX format is diminishing as scale goes up and specialized implementations are created. Wholly different lookup technologies, such as general purpose databases or search engines, has zero need for the format-oriented optimizations.

 

 

Posted in Uncategorized | Leave a comment

Faster grouping, take 1

A failed attempt of speeding up grouping in Solr, with an idea for next attempt.

Grouping at a Statsbiblioteket project

We have 100M+ articles from 10M+ pages belonging to 700K editions of 170 newspapers in a single Solr shard. It can be accessed at Mediestream. If you speak Danish, try searching for “strudsehest”. Searches are at the article level, with the results sorted by score and grouped by edition, with a maximum of 10 articles / edition. Something like this:

q=strudsehest&group=true&group.field=edition&group.limit=10

This works well for most searches. But for the heavier ones, response times creeps into seconds, sometimes exceeding the 10 second timeout we use. Not good. So what happens in a grouped search that is sorted by document score?

  1. The hits are calculated
  2. A priority queue is used to find the top-X groups with the highest scores
    1. For each hit, calculate its score
    2. If the score is > the lowest score in the queue, resolve the group value and update the priority queue
  3. For each of the top-X groups, a priority queue is created and filled with document IDs
    1. For each hit, calculate is score and resolve its group value (a BytesRef)
    2. If the group value matched one of the top-X groups, update that group’s queue
      1. Updating the queue might involve resolving multiple field values for the document, depending on in-group sorting
  4. Iterate the top-X groups and resolve the full documents

Observation 1: Hits are iterated twice. This is hard to avoid if we need more than 1 entry in each group. An alternative would be to keep track of all groups until all the hits has been iterated, but this would be extremely memory costly with high cardinality fields.

Observation 2: In step 3.1, score and group resolving is performed for all hits. It is possible to use the same logic as step 2.1, where the group is only resolved if the score is competitive.

Attempt 1: Delayed group resolving

The idea in observation 2 has been implemented as a kludge-hacky-proof-of-concept. Code is available at the group_4_10 branch at GitHub for those who like hurt.

When the hits are iterated the second time, all scores are resolved but only the group values for the documents with competitive scores are resolved. So how well does it work?

run=1_collated

Lazy group value resolving for Solr

Observation: Optimized (aka lazy group value resolving) grouping is a bit slower than vanilla Solr grouping for some result sets, probably the ones where most of the group values has to be resolved. For other result sets there is a clear win.

It should be possible to optimize a bit more and bring the overhead of the worst-case optimized groupings down to near-zero. However, since there are so few best-case result sets and since the win is just about a third of the response time, I do not see this optimization attempt as being worth the effort.

Idea: A new level of lazy

IMPORTANT: I am a klutz and did not read the Solr code properly. Please skip ahead to the next section.

Going back to the algorithm for grouping we can see that “resolving the value” occurs multiple times. But what does it mean?

With DocValued terms, this is really a two-step process: The DocValue ordinal is requested for a given docID (blazingly fast) and the ordinal is used to retrieve the term (fast) in the form of a BytesRef. You already know where this is going, don’t you?

Millions of “fast” lookups accumulates to slow and we don’t really need the terms as such. At least not before we have to deliver the final result to the user. What we need is a unique identifier for each group value and the ordinal is exactly that.

But wait. Ordinals are not comparable across segments! We need to map the segment ordinals to a global structure. Luckily this is exactly what happens when doing faceting with facet.method=fc, so we can just scrounge the code from there.

With this in mind, the algorithm becomes

  1. The hits are calculated
  2. A priority queue is used to find the top-X groups with the highest scores
    1. For each hit, calculate its score
    2. If the score is > the lowest score in the queue, resolve the group value ordinal and update the priority queue
  3. For each of the top-X groups, a priority queue is created and filled with document IDs
    1. For each hit, resolve its group value segment-ordinal and convert that to global ordinal
    2. If the group value ordinal matches one of the top-X groups, update that group’s queue
      1. Updating the queue might involve resolving the document score or resolving multiple field value ordinals for the document, depending on in-group sorting
  4. Iterate the top-X groups and resolve the Terms from the group value ordinals as well as the full documents

Note how the logic is reversed for step 3.1, prioritizing value ordinal resolving over score calculation. Experience from the facet code suggests that ordinal lookup is faster than score calculation.

This idea has not been implemented yet. Hopefully it will be done Real Soon Now, but no promises.

Update 2016-01-28: Read the code properly

The Solr code does not resolve BytesRefs before it is necessary in step 3. On the contrary, it resolved the segment-specific group ordinals from the BytesRefs delivered by step 2 and uses fast lookup of those ordinals before calculating the score.

So while the first pass (determining the groups) seems optimizable by postponing BytesRef resolving, the second pass does not have any apparent performance failings.

Posted in Uncategorized | Leave a comment

The ones that got away

Two and a half ideas of improving Lucene/Solr performance that did not work out.

Track the result set bits

At the heart of Lucene (and consequently also Solr and ElasticSearch), there is a great amount of doc ID set handling. Abstractly speaking, a doc ID set is a collection of internal document IDs. If there are 10 documents in the index and a query matches the first and the fourth, the set would be [0, 3] as we count from 0. Every time there is filtering or faceting, bit sets are used.

SortedIntDocSet in Solr is one example of a document ID set. The IDs are stored after each other in an int[]. This takes up 32 bits per document in the set and is used for small result sets only.

FixedBitSet in Lucene is another example. It reserves 1 bit for each document in the full corpus. If the bit for a document is set, it means that the document is part of the set. All the bits are stored in a long[], which makes most operations quite fast: To calculate the union, the relevant code is

int pos = Math.min(numWords, otherNumWords);
while (--pos >= 0) {
  thisArr[pos] |= otherArr[pos];
}

FixedBitSet is very widely used. It is referenced in the Lucene/Solr code 451 times! As opposed to SortedIntDocSet et al, it can contain result sets of any size and reading or changing a specific bit is O(1). It does have a weakness though: Sparse sets.

Nearly all operations on FixedBitSet takes time linear to the total number of documents in the index, not to the number of documents in the set. Calculating the union of [5, 87] and [1, 5, 1056978], when there are 20 million documents in the index, requires stepping through all the 300,000 longs (20M / 64) in each array to check for set bits.

Idea: Tracking changes to the structure works really well for sparse faceting, so why not for FixedBitSets? Add a new tracking bitset where each bit corresponds to a single long in the underlying FixedBitSet. If the tracking bit is set, the FixedBitSet long contains at least one set bit. To locale set bits, iterate the tracking bitset and look for longs != 0. This reduces the number of unnecessary lookups by a factor 64. Of course a third bitset can be added on top of the two, reducing unnecessary lookups by a total factor of 4096.

An implementation is available at github with a performance test. Instructions for running is in the TRACKED_README.txt file.  Some reduced sample runs, where open if the vanilla FixedBitSet and tracked is the tracked version:

2K / 20M bits set (0.01% filled)
test         impl        ms
union        open      0.42
union        tracked   0.16
cardinality  open      0.34
cardinality  tracked   0.09
icount       open      0.59
icount       tracked   0.57
nextSetBit   open      0.31
nextSetBit   tracked   0.12
iterator     open      0.32
iterator     tracked   0.14

20K / 20M bits set (0.1% filled)
test         impl        ms
union        open      0.81
union        tracked   1.08
cardinality  open      0.34
cardinality  tracked   0.49
icount       open      0.58
icount       tracked   2.18
nextSetBit   open      0.58
nextSetBit   tracked   0.79
iterator     open      0.68
iterator     tracked   0.80

200K / 20M bits set (1.0% filled)
test         impl        ms
union        open      3.51
union        tracked   4.93
cardinality  open      0.43
cardinality  tracked   1.48
icount       open      0.59
icount       tracked  10.26
nextSetBit   open      3.21
nextSetBit   tracked   4.24
iterator     open      3.98
iterator     tracked   2.54

Technically the implementation delivers: Sparse result sets are processed faster. Unfortunately most operations require the sets to be less than 0.1% filled and worse, the overhead for non-sparse bitsets is prohibitive. There might be some special places where the tracked bitset can be used, but it is not viable as a general replacement for FixedBitSet.

Better scaling of top-X searches

This idea got pretty far with LUCENE-6828 JIRA and an analysis of the speed up.

Requesting the top-X relevance ranked results from Lucene/Solr is pretty much the core use case. The underlying implementation uses a binary heap based priority queue with Java Objects. Due to the cache-adverse nature of binary heaps and the large overhead of tiny Objects, this does not scale well: Do not ask top top-1M hits and really do not ask for top-100M.

Using an array based bit packed heap implementation yields superior space- and speed-performance for top-100K+ requests (3x+ for both memory and speed, plus a lot less GC activity; see the previously linked analysis for details), with the advantages growing with the number of concurrent requests. No win for small top-X requests, but no big loss either.

So success? Sorta, kinda. Turns out that there are generally two scenarios:

  1. Request a relatively modest top-X, such af top-10 or top-1000.
  2. Export the <em>full</em> result set, which can be millions or billions.

As the array-based heap has no advantage over the vanilla Solr heap for small result sets and as full exports are much better handled by Solr streaming export, there is near-zero need for the middle ground that a tracked bitset provides. It is a solution looking for a problem.

Cache-friendly priority queue

Since the vanilla Solr binary heap is cache-adverse, why not try and fix that problem? Yes, totally ignoring the “there is no problem as nobody requests large top-X results”-part here.

The idea is fairly simple: Binary heaps works well when the heap can be fully contained in the CPU cache, so making a heap of heaps means that most operations will be done on data that are already cached. There is a very nice write up on this at Playful Programming: Cache optimizing a priority queue.

This was implemented in Java at the hitdocs-trunk branch. Performance tests are in org.apache.lucene.search.TestHitQueue. Measurements below are from testPQPerformanceAtomicHeapsLimited, where sentinel is vanilla Solr, packed is the array-based heap described above and bheap4 is the best performing heaps-within-heaps implementation. Lower percentages are best and the winning implementation for each test is marked with ■.

Threads       Top-X      Hits      Sentinel         Packed         BHeap4 
      1     1000000    100000   118.49 100%     11.06   9%■    20.47  17% 
      1     1000000   1000000  1356.56 100%    133.51  10%■   270.82  20% 
      1     1000000   2000000  2216.50 100%    566.67  26%    472.43  21%■
      1     1000000   5000000  3406.35 100%    827.23  24%    771.33  23%■
      1     1000000  10000000  4183.84 100%   1054.94  25%   1012.78  24%■
      4     1000000    100000   246.00 100%     22.38   9%■    41.73  17% 
      4     1000000   1000000  2154.64 100%    232.00  11%■   504.80  23% 
      4     1000000   2000000  3356.96 100%   1057.01  31%    772.90  23%■
      4     1000000   5000000  5022.50 100%   1484.44  30%   1227.93  24%■
      4     1000000  10000000  6503.33 100%   1910.10  29%   1891.28  29%■
     16     1000000    100000  1310.95 100%     68.29   5%■   119.87   9% 
     16     1000000   1000000  6731.49 100%    605.90   9%■  1405.25  21% 
     16     1000000   2000000 11882.59 100%   3713.82  31%   2744.36  23%■
     16     1000000   5000000 19589.75 100%   5389.25  28%   4447.21  23%■
     16     1000000  10000000 25522.07 100%   6921.96  27%   4843.61  19%■

Note: The cases where hits <= top-X (100K and 1M) are special as packed sorts result sets with hits <= top-X using the more cache-friendly merge sort. This explains why packed is twice as fast as bheap4 for these cases. A similar optimization could be implemented for bheap4.

Looking through rose-tinted glasses, bheap4 is a bit better than packed (and they are both vastly superior to sentinel). But the implementation of bheap4 is bordering on voodoo, with the multi-level offset calculations and all the bit-tricks needed to avoid doing expensive multiplications and modulo calculations.

Hopefully the speed of bheap can be improved further – the Playful Programming author Björn Fahller certainly got fine results with his C++ implementation. But until that happens, its miniscule gains are not worth the added complexity.

Posted in eskildsen, Hacking, Low-level, Lucene, open source, Performance, Solr | Leave a comment