Faster DocValues in Lucene/Solr 7+

This is a fairly technical post explaining LUCENE-8374 and its implications on Lucene, Solr and (qualified guess) Elasticsearch search and retrieval speed. It is primarily relevant for people with indexes of 100M+ documents.


We have a Solr setup for Netarchive Search at the Royal Danish Library. Below are response times grouped by the magnitude of the hitCount with and without the Lucene patch.


Grouping on url_norm, cardinality stats on url_norm, faceting on 6 fields and retrieval of all stored & docValued fields for the top-10 documents in our search result.

As can be seen, the median response time with the patch is about half that of vanilla Solr. The 95% percentile shows that the outliers has also been markedly reduced.

Long explanation follows as to what the patch does and why indexes with less than 100M documents are not likely to see the same performance boost.

Lucene/Solr (birds eye)

Lucene is a framework for building search engines. Solr is a search engine build using Lucene. Lucene, and thereby Solr, is known as an inverted index, referring to the termsdocuments structure that ensures fast searches in large amounts of material.

As with most things, the truth is a little more complicated. Fast searches are not enough: Quite obviously it also helps to deliver a rich document representation as part of the search. More advanced features are grouping, faceting, statistics, mass exports etc. All of these have in common that they at some point needs to map documentsterms.

Lucene indexes are logically made up of segments containing documents made up of fields containing terms (or numbers/booleans/raws…). Fields can be

  • indexed for searching, which means termsdocuments lookup
  • stored for document retrieval
  • docValues for documentsterms lookup

stored and docValues representations can both be used for building a document representation as part of common search. stored cannot be used for grouping, faceting and similar purposes. The two strengths of stored are

  • Compression, which is most effective for “large” content.
  • Locality, meaning that all the terms for stored fields for a given document are stored together, making is low-cost to retrieve the content for multiple fields.

Whenever grouping, faceting etc. needs the documentsterms mapping, it can either be resolved from docValues, which are build for this exact purpose, or by un-inverting the indexed terms. Un-inversion costs time & memory, so the strong recommendation is to enable docValues for grouping, faceting etc.

DocValues in Lucene/Solr 7+ (technical)

So the mission is to provide a documentsterms (and numbers/booleans/etc) lookup mechanism. In Lucene/Solr 4, 5 & 6 this mechanism had a random access API, meaning that terms could be requested for documents in no particular order. The implementation presented some challenges and from Lucene/Solr 7 this was changed to an iterator API (see LUCENE-7407), meaning that terms must be resolved in increasing document ID order. If the terms are needed for a document with a lower ID that previously requested, a new iterator must be created and the iteration starts from the beginning.

Most of the code for this is available in Lucene70DocValuesProducer and IndexedDISI. Digging into it, the gains from the iterative approach becomes apparent: Besides a very clean implementation with lower risk of errors, the representation is very compact and requires very little heap to access. Indeed, the heap requirement for the search nodes in Netarchive Search at the Royal Danish Library was nearly halved when upgrading from Solr 4 to Solr 7. The compact representation is primarily the work of Adrian Grand in LUCENE-7489 and LUCENE-7589.

When reading the wall of text below, it helps to mentally view the structures as linked lists: To get to a certain point in the list, all the entries between the current entry and the destination entry needs to be visited.

DocValues sparseness and packing

It is often the case that not all documents contains terms for a given field. When this is case, the field is called sparse.

A trivial representation for mapping documentsterms for a field with 0 or 1 long values per document would be an array of long[#documents_in_segment], but this takes up 8 bytes/document, whether the document has a value defined or not.

LUCENE-7489 optimizes sparse values by using indirection: First step is to determine whether a document has a value or not. If it has a value, an index into a value-structure is derived. The second step is to retrieve the value from the value-structure. IndedDISI takes care of the first step:

For each DocValues field, documents are grouped in blocks of 65536 documents. Each block starts with meta-data stating the block-ID and the number of documents in the block that has a value for the field. There are 4 types of blocks:

  • EMPTY: 0 documents in the block has a term.
  • SPARSE: 1-4095 documents in the block has a term.
  • DENSE: 4096-65535 documents in the block has a term.
  • ALL: 65536 documents in the block has a term.

Step 1.1: Block skipping

To determine if a document has a value and what the index of the value is, the following pseudo-code is used:

while (blockIndex < docID/65536) {
  valueIndex += block.documents_with_values_count
  block = seekToBlock(block.nextBlockOffset)
if (!block.hasValue(docID%65536)) {  // No value for docID
valueIndex += block.valueIndex(docID%65536)

Unfortunately it does not scale with index size: At the Netarchive at the Royal Danish Library, we use segments with 300M values (not a common use case), which means that 4,500 blocks must be iterated in the worst case.

Introducing an indexValue cache solves this and the code becomes

valueIndex = valueCache[docID/65536]
block = seekToBlock(offsetCache[docID/65536])
if (!block.hasValue(docID%65536) {  // No value for docID
valueIndex += block.valueIndex(docID%65536)

The while-loop has been removed and getting to the needed block is constant-time.

Step 1.2: Block internals

Determining the value index inside of the block is trivial for EMPTY and ALL blocks. SPARSE is a list of the documentIDs with values that is simply iterated (this could be a binary search). This leaves DENSE, which is the interesting one.

DENSE blocks contains a bit for each of its 65536 documents, represented as a bitmap = long[1024]. Getting the value index is a matter of counting the set bits up to the wanted document ID:

inBlockID = docID%65536
while (inBlockIndex < inBlockID/64) {
  valueIndex += total_set_bits(bitmap[inBlockIndex++])
valueIndex += set_bits_up_to(bitmap[inBlockIndex], inBlockID%64)

This is not as bad as it seems as counting bits in a long is a single processor instruction on modern CPUs. Still, doing 1024 of anything to get a value is a bit much and this worst-case is valid for even small indexes.

This is solved by introducing another cache: rank = char[256] (a char is 16 bytes):

inBlockID = docID%65536
valueIndex = rank[inBlockID/8]
inBlockIndex = inBlockID/8*8
while (inBlockIndex < inBlockID/64) {
  valueIndex += total_set_bits(bitmap[inBlockIndex++])
valueIndex += set_bits_up_to(bitmap[inBlockIndex], inBlockID%64)

Worst-case it reduced to a rank-cache lookup and summing of the bits from 8 longs.

Now that step 1: Value existence and value index has been taken care of, the value itself needs to be resolved.

Step 2: Whole numbers representation

There are different types of values Lucene/Solr: Strings, whole numbers, floating point numbers, booleans and binaries. On top of that a field can be single- or multi-valued. Most of these values are represented in a way that provides direct lookup in Lucene/Solr 7, but whole numbers are special.

In Java whole numbers are represented in a fixed amount of bytes, depending on type: 1 byte for byte, 2 bytes for short or char, 4 bytes for integer and 8 bytes for long. This is often wasteful: The sequence [0, 3, 2, 1] could be represented using only 2 bits/value. The sequence [0, 30000000, 20000000, 10000000] could also be represented using only 2 bits/value if it is known that the greatest common divisor is 10⁷. The list of tricks goes on.

For whole numbers, Lucene/Solr uses both the smallest amount of bits required by PackedInts for a given sequence as well as greatest common divisor and constant offset. These compression techniques works poorly both for very short sequences and for very long ones; LUCENE-7589 splits whole numbers into sequences of 16384 numbers.

Getting the value for a given index is a matter of locating the right block and extracting the value from that block:

while (longBlockIndex < valueIndex/16386) {
  longBlock = seekToLongBlock(longBlock.nextBlockOffset)
value = longBlock.getValue(valueIndex%16386)

This uses the same principle as for value existence and the penalty for iteration is also there: In our 300M documents/segment index, we have 2 numeric fields where most values are present. They have 28,000 blocks each, which must be all be visited in the worst case.

The optimization is the same as for value existence: Introduce a jump table.

longBlock = seekToLongBlock(longJumps[valueIndex/16384))
value = longBlock.getValue(valueIndex%16386)

Value retrieval becomes constant time.

Theoretical observations

  • With a pure iterative approach, performance goes down when segment size goes up and the amount of data to retrieve goes up slower than index size. The performance slowdown only happens after a certain point! As long as the gap between the docIDs is small enough to be within the current or the subsequent data chunk, pure iteration is fine.
    Consequently, the requests that involves lots of monotonically increasing docID lookups (faceting, sorting & grouping for large result sets) fits the iterative API well as they needs data from most data blocks.
    Requests that involves fewer monotonically increasing docID lookups (export & document retrieval for all requests, faceting, sorting & grouping for small result sets) fits poorly as they result in iteration over data blocks that do not provide any other information than a link to the next data block.
  • As all the structures are storage-backed, iterating all data blocks – even when it is just to get a pointer to the next block – means a read request. This is problematic, unless there is plenty of RAM for caching: Besides the direct read-time impact, the docValues structures will hog the disk cache.

With this in mind, it makes sense to check the patch itself for performance regressions with requests for a lot of values as well as test with the disk cache fully warmed and containing the structures that are used. Alas, this has to go on the to-do for now.


Hardware & index

Testing was done against our production Netarchive Search. It consists of 84 collections, accessed as a single collection using Solr’s alias mechanism. Each collection is roughly 300M documents / 900GB of index data optimized to 1 segment, each segment on a separate SSD. Each machine has 384GB of RAM with about 220GB free for disk cache. There are 4 machines, each serving 25 collections (except the last one that only serves 9 at the moment). This means that ~1% of total index size is disk cached.


  • Queries were constructed by extracting terms of varying use from the index and permutating them for simple 1-4 term queries
  • All tests were against the full production index, issued at times when it was not heavily used
  • Queries were issued single-threaded, with no repeat queries
  • All test setups were executed 3 times, with a new set of queries each time
  • The order of patch vs. sans-patch tests was always patch first, to ensure that any difference in patch favour was not due to better disk caching

How to read the charts

All charts are plotted with number of hits on the x-axis and time on the y-axis. The x-axis is logarithmic with the number of hits bucketed by magnitude: First bucket holds all measurements with 1-9 hits, second bucket holds those with 10-99 hits, the third holds those with 100-999 hits and so forth.

The response times are displayed as box plots where

  • Upper whisker is the 95% percentile
  • Top of the box is 75% percentile
  • Black bar is 50% percentile (the median)
  • Bottom of the box is 25% percentile
  • Lower whisker is minimum measured time

Each bucket holds 4 boxes

  • Test run 2, patch enabled
  • Test run 2, vanilla Solr
  • Test run 3, patch enabled
  • Test run 3, vanilla Solr

Test run 1 is discarded to avoid jitter from cache warming. Ideally the boxes from run 3 should be the same as for run 2. However, as the queries are always new and unique, an amount of variation is to be expected.

Important note 1: The Y-axis max-value changes between some of the charts.

Document retrieval

There seems to be some disagreement as to whether the docValues mechanism should ever be used to populate documents, as opposed to using stored. This blog post will only note that docValues are indeed used for this purpose at the Royal Danish Library and let it be up to the reader to seek more information on the matter.

There are about 70 fields in total in Netarchive Search, with the vast majority being docValued String fields. There are 6 numeric DocValued fields.


Retrieval of top-20 documents with all field values

Observation: Response times for patched (blue & green) are markedly lower than vanilla (ping & orange). The difference is fairly independent of hit count, which matches well with the premise that the result set size is constant at 20 documents.


Grouping on the String field url_norm field is used in Netarchive Search to avoid seeing too many duplicates. To remove the pronounced difference caused by document retrieval, only the single field url_norm is requested for only 1 group with 1 document.


Grouping on url_norm

Observation: The medians for patched vs. vanilla are about the same, with a slight edge to patched. The outliers (the top T of the boxes) are higher for vanilla.


Faceting is done for 6 fields of varying cardinality. As with grouping, the effect of document retrieval is sought minimized.


Faceting on fields domain, crawl_year, public_suffix, content_type_norm, status_code, host

Observation: Patched is an improvement over vanilla up to 10M+ hits.


In this test, sorting is done descending on content_length, to locate the largest documents in the index. As with grouping, the effect of document retrieval is sought minimized.


Sorting on content_length

Observation: Patched is a slight improvement over vanilla.


In order to provide an approximate hitCount with grouping, the cardinality of the url_norm field is requested. As with grouping, the effect of document retrieval is sought minimized.

HyperLogLog cardinality on url_norm

HyperLogLog cardinality on url_norm

Observation: Too much jitter to say if patch helps here.

Numeric statistics

Statistics (min, max, average…) on content_length is a common use case in Netarchive Search. As with grouping, the effect of document retrieval is sought minimized.


Numeric statistics on content_length

Observation: Patched is a slight improvement over vanilla.

Cocktail effect, sans document

Combining faceting, grouping, stats and sorting while still minimizing the effect of document retrieval.


Faceting on 6 fields, grouping on url_norm, stats on content_length and sorting on content_length

Observation: Patched is a clear improvement over vanilla.

Production request combination

The SolrWayback front end for Netarchive Search commonly use document retrieval for top-10 results, grouping, cardinality and faceting. This is the same chart as the teaser at the top, with the addition of test run 2.


Grouping on url_norm, cardinality stats on url_norm, faceting on 6 fields and retrieval of all stored & docValued fields for the top-10 documents in our search result.

Observation: Patched is a pronounced improvement over vanilla.

The combination of multiple docValues using request parameters is interesting as the effect of the patch on the whole seems greater than the sum of the individual parts. This could be explained by cache/IO saturation when using vanilla Solr. Whether the cause, this shows that it is important to try and simulate real-world workflows as close as possible.

Overall observations

  • For most of the performance tests, the effect of the LUCENE-8374 patch vs. vanilla is pronounced, but limited in magnitude
  • Besides lowing the median, there seems to be a tendency for the patch to reduce outliers, notably for  grouping
  • For document retrieval, the patch improved performance significantly. Separate experiments shows that export gets a similar speed boost
  • For all the single-feature tests, the active parts of the index data are so small that they are probably cached. Coupled with the limited improvement that the patch gives for these tests, it indicates that the patch will in general have little effect on systems where the full index is is disk cache
  • The performance gains with the “Production request combination” aka the standard requests from our researchers, are very high

Future testing

  • Potential regression for large hit counts
  • Max response times (not just percentile 95)
  • Concurrent requests
  • IO-load during tests
  • Smaller corpus
  • Export/streaming
  • Disk cache influence

Want to try?

There is a patch for Solr trunk at LUCENE-8374 and it needs third party validation from people with large indexes. I’ll port it to any Solr 7.x-version requested and if building Solr is a problem, I can also compile it and put it somewhere.

Hopefully it will be part of Solr at some point.

Update 20181003: Patch overhead and status

Currently the patch is search-time only. Technically is could also be index-time by modifying the codec.

For a single index in the Netarchive Search setup, the patch adds 13 seconds to first search-time and 31MB of heap out of 8GB allocated for the whole Solr. The 13 seconds is in the same ballpark (this is hard to measure) as a single unwarmed search with top-1000 document retrieval.

The patch is ported to Solr 7.3.0 and used in production at the Royal Danish Library. It is a debug-patch, meaning that the individual optimizations can be enabled selectively for easy performance comparison.

See the LUCENE-8374 JIRA-issue for details.

Posted in eskildsen, Hacking, Low-level, Lucene, Performance, Solr | Leave a comment

Prebuild Big Data Word2Vec dictionaries

                   Prebuild and trained Word2Vec dictionaries ready for use

Two different prebuild big data Word2Vec dictionaries has been added to LOAR (Library Open Access Repository) for download. These dictionaries are build from the text of 55,000 e-books from Project Gutenberg and 32.000.000 Danish newspaper pages.

35.000 of the Gutenberg e-books are English, but over 50 different languages are present in the dictionaries. Even though they are different languages the Word2Vec algorithm did a good job of separating the different languages so it is almost like 50 different Word2Vec dictionaries.

The text from the danish newspapers is not public available so you would not be able to build this dictionary yourself. A total of 300Gb of raw text went into building the dictionary, so it is probably the largest Word2Vec dictionary build on a Danish corpus. Since the danish newspapers suffer from low quality OCR, many of words in the dictionary are misspellings. Using this dictionary it was possible to fix many of the OCR errors due the nature of the Word2Vec algorithm, since a given word appears in similar contexts despite its misspellings and is identified by its context. (see

Download and more information about the Word2Vec dictionaries:



Online demo of the two corpora: Word2Vec demo







Posted in Uncategorized | Leave a comment

SolrWayback software bundle has been released

The SolrWayback software bundle can be used to search and playback archived webpages in Warc format. It is an out of the box solution with index workflow, Solr and Tomcat webserver and a free text search interface with playback functionality. Just add your Warc to a folder and start the index job.

The search interface has additional features besides freetext search. This includes:

  • Image search similar to google images
  • Search by uploading a file. (image/pdf etc.) See if the resource has been harvested and from where.
  • Link graph showing links (ingoing/outgoing) for domains using the D3 javascript framework.
  • Raw download of any harvested resource from the binary Arc/Warc file.
  • Export a search resultset to a Warc-file. Streaming download, no limit of size of resultset.
  • An optional built in SOCKS proxy can be used to view historical webpages without browser leaking resources from the live web.

See the GitHub page for screenshots of SolrWayback and scroll down to the install guide try it out.

Link: SolrWayback





Posted in Uncategorized | Leave a comment

Visualising Netarchive Harvests


An overview of website harvest data is important for both research and development operations in the netarchive team at Det Kgl. Bibliotek. In this post we present a recent frontend visualisation widget we have made.

From the SolrWayback Machine we can extract an array of dates of all harvests of a given URL. These dates are processed in the browser into a data object containing the years, months, weeks and days to enable us to visualise the data. Futhermore the data is given an activity level from 0-4.

The high-level overview seen below is the year-month graph. Each cell is coloured based on the activity level relative to the number of harvests in the most active month. For now we use a linear calculation so gray means no activity, activity level 1 is 0-25% of the most active month, and level 4 is 75-100% of the most active month. As GitHub fans, we have borrowed the activity level colouring from the user commit heat map.



We can visualise a more detailed view of the data as either a week-day view of the whole year, or as a view of all days since the first harvest. Clicking one of these days reveals all the harvests for the given day, with links back to SolrWayback to see a particular harvest.



In the graph above we see the days of all weeks of 2009 as vertical rows. The same visualisation can be made for all harvest data for the URL, as seen below (cut off before 2011, for this blog post).



There are both advantages and disadvantages to using the browser-based data calculation. One of the main advantages is a very portable frontend application. It can be used with any backend application that outputs an array of dates. The initial idea was to make the application usable for several different in-house projects. Drawbacks to this approach is, of course, the scalability. Currently the application processes 25.000 dates in about 3-5 seconds on the computer used to develop the application (a 2016 quad core Intel i5).

The application uses the frontend library VueJS and only one other dependency, the date-fns library. It is completely self-contained and it is included in a single script tag, including styles.

Ideas for further development.

We would like to expand this to also include both:

  1. multiple URLs, which would be nice for resources that have changed domain, subdomain or protocol over time, e.g. the URL, and could be used for the danish newspaper Politiken.
  2. domain visualisation for all URLs on a domain. A challenge here will of course be the resources needed to process the data in the browser. Perhaps a better calculation method must be used – or a kind of lazy loading.
Posted in Blogging, Solr, Web | Tagged , | Leave a comment

SolrWayback Machine

Another ‘google innovation week’ at work has produced the SolrWayback Machine. It works similar to the Internet Archive: Wayback Machine ( and can be used to show harvested web content (Warc files).  The Danish Internet Archive has over 20billion harvested web objects and takes 1 Petabyte of storage.

The SolrWayback engine require you have indexed the Warc files using the Warc-indexer tool from British Library. (

It is quite fast and comes with some additional features as well:

  •  Image search similar to google images
  •  Link graphs showing  links (ingoing/outgoing) for domains using the D3 javascript framework.
  •  Raw download of any harvested resource from the binary Arc/Warc file.

Unfortunately  the collection is not available for the public so I can not show you the demo. But here is a few pictures from the SolrWayback machine.



SolrWayback at GitHub:

Posted in Uncategorized | 1 Comment

juxta – image collage with metadata

Creating large collages of images to give a bird’s eye view of a collection seems to be gaining traction. Two recent initiatives:

Combining those two ideas seemed like a logical next step and juxta was born: A fairly small bash-script for creating million-scale collages of images, with no special server side.  There’s a small (just 1000 images) demo at SBLabs.

Presentation principle

The goal is to provide a seamless transition from the full collection to individual items, making it possible to compare nearby items with each other and locate interesting ones. Contextual metadata should be provided for general information and provenance.

Concretely, the user is presented with all images at once and can zoom in to individual images in full size. Beyond a given threshold, metadata are show for the image currently under the cursor, or finger if a mobile device is used. An image description is displayed just below the focused image, to avoid disturbing the view. A link to the source of the image is provided on top.


Overview of historical maps


Meta-data for a specific map

Technical notes, mostly on scaling

On the display side, OpenSeadragon takes care of the nice zooming. When the user moves the focus, a tiny bit of JavaScript spatial math resolves image identity and visual boundaries.

OpenSeadragon uses pyramid tiles for display and supports the Deep Zoom protocol can be implemented using only static files. The image to display is made up of tiles of (typically) 256×256 pixels. When the view is fully zoomed, only the tiles within the viewport are requested. When the user zooms out, the tiles from the level above are used. The level above is half the width and half the height and is thus represented by ¼ the amount of tiles. And so forth.

Generating tiles is heavy

A direct way of creating the tiles is

  1. Create one large image of the full collage (ImageMagick’s montage is good for this)
  2. Generate tiles for the image
  3. Scale the image down to 50%×50%
  4. If the image is larger than 1×1 pixel then goto 2

Unfortunately this does not scale particularly well. Depending on size and tools, it can take up terabytes of temporary disk space to create the full collage image.

By introducing a size constraint, juxta removes this step: All individual source images are scaled & padded to have the exact same size. The width and height of the images are exact multiples of 256. Then the tiles can be created by

  1. For each individual source image, scale, pad and split the image directly into tiles
  2. Create the tiles at the level above individually by joining the corresponding 4 tiles below and scale to 50%×50% size
  3. If there are more than 1 tile or that tile is larger than 1×1 pixel then goto 2

As the tiles are generated directly from either source images or other tiles, there is no temporary storage overhead. As each source image and each tile are processed individually, it is simple to do parallel processing.

Metadata takes up space too

Displaying image-specific metadata is simple when there are just a few thousand images: Use an in-memory array of Strings to hold the metadata and fetch it directly from there. But when the number of images goes into the millions, this quickly becomes unwieldy.

juxta groups the images spatially in buckets of 50×50 images. The metadata for all the images in a bucket are stored in the same file. When the user moved the focus to a new image, the relevant bucket is fetched from the server and the metadata are extracted. A bucket cache is used to minimize repeat calls.

Most file systems don’t like to hold a lot of files in the same folder

While the limits differ, common file systems such as ext, hfs & ntfs all experience performance degradation with high numbers of files in the same folder.

The Deep Zoom protocol in conjunction with file-based tiles means that the amount of files at the deepest zoom level is linear to the number of source images. If there are 1 million source images, with full-zoom size 512×512 pixels (2×2 tiles), the number of files in a single folder will be 2*2*1M = 4 million. Far beyond the comfort-zone fo the mentioned file systems (see the juxta readme for tests of performance degradation).

juxta mitigates this by bucketing tiles in sub-folders. This ensures linear scaling of build time at least up to 5-10 million images. 100 million+ images would likely deteriorate build performance markedly, but at that point we are also entering “is there enough free inodes on the file system?” territory.

Unfortunately the bucketing of the tile files is not in the Deep Zoom standard. With OpenSeadragon, it is very easy to change the mapping, but it might be more difficult for other Deep Zoom-expecting tools.

Some numbers

Using a fairly modern i5 desktop and 3 threads, generating a collage of 280 5MPixel images, scaled down to 1024×768 pixels (4×3 tiles) took 88 seconds or about 3 images/second. Repeating the experiment with a down-scale to 256×256 pixels (smallest possible size) raised the speed to about 7½ image/second.

juxta comes with a scale-testing script that generates sample images that are close (but not equal) to the wanted size and repeats them for the collage. With this near-ideal match, processing speed was 5½ images/second for 4×3 tiles and 33 images/second for 1×1 tiles.

The scale-test script has been used up to 5 million images, with processing time practically linear to the number of images. At 33 images/second that is 42 hours.

Posted in Uncategorized | Leave a comment

Automated improvement of search in low quality OCR using Word2Vec

This abstract has been accepted for Digital Humanities in the Nordic Countries 2nd Conference,

In the Danish Newspaper Archive[1] you can search and view 26 million newspaper pages. The search engine[2] uses OCR (optical character recognition) from scanned pages but often the software converting the scanned images to text makes reading errors. As a result the search engine will miss matching words due to OCR error. Since many of our newspapers are old and the scans/microfilms is also low quality, the resulting OCR constitutes a substantial problem. In addition, the OCR converter performs poorly with old font types such as fraktur.

One way to find OCR errors is by using the unsupervised Word2Vec[3] learning algorithm. This algorithm identifies words that appear in similar contexts. For a corpus with perfect spelling the algorithm will detect similar words synonyms, conjugations, declensions etc. In the case of a corpus with OCR errors the Word2Vec algorithm will find the misspellings of a given word either from bad OCR or in some cases journalists. A given word appears in similar contexts despite its misspellings and is identified by its context. For this to work the Word2Vec algorithm requires a huge corpus and for the newspapers we had 140GB of raw text.

Given the words returned by Word2Vec we use a Danish dictionary to remove the same word in different grammatical forms. The remaining words are filtered by a similarity measure using an extended version of Levenshtein distance taking the length of the word and an idempotent normalization taking frequent one and two character OCR errors into account.

Example: Let’s say you use the Word2Vec to find words for banana and it returns: hanana, bananas, apple, orange. Remove bananas using the (English) dictionary since this is not an OCR error. For the three remaining words only hanana is close to banana and it is thus the only misspelling of banana found in this example. The Word2Vec algorithm does not know how a words is spelled/misspelled, it only uses the semantic and syntactic context.

This method is not an automatic OCR error corrector and cannot output the corrected OCR. But when searching it will appear as if you are searching in an OCR corrected text corpus. Single word searches on the full corpus gives an increase from 3% to 20% in the number of results returned. Preliminary tests on the full corpus shows only relative few false positives among the additional results returned, thus increasing recall substantially without a decline in precision.

The advantage of this approach is a quick win with minimum impact on a search engine [2] based on low quality OCR. The algorithm generates a text file with synonyms that can be used by the search engine. Not only single words but also phrase search with highlighting works out of the box. An OCR correction demo[4] using Word2Vec on the Danish newspaper corpus is available on the Labs[5] pages of The State And University Library, Denmark.

[1] Mediestream, The Danish digitized newspaper archive.

[2] SOLR or Elasticsearch etc.

[3] Mikolov et al., Efficient Estimation of Word Representations in Vector Space

[4] OCR error detection demo (change word parameter in URL)

[5] Labs for State And University Library, Denmark


Posted in Uncategorized | 2 Comments