In The Danish Newspaper Archive you can search in 25million newspaper pages and view the pages. The search engine uses OCR (optical character recognition) from scanned pages but often the software reading the text from the scanned images makes reading errors. As a result of this the search engine will miss matching words due to OCR error. Since many of our newspapers are old and quality of the scans/microfilms is not very good combined with OCR software has problems old fonts types, the bad OCR constitutes a substantial problem.
One way to find these OCR errors is using the Word2Vec algorithm that I have written about before. The algorithm detects words that appear in similar contexts. So for a corpus with perfect spelling the algorithm will detect similar words,synonyms,conjugations,declensions etc. But in the case of a corpus with OCR errors the Word2Vec algorithm will also find the
misspellings of a given word either from bad OCR or in some case journalists. A given misspelled word appear the the exactly same contexts for all it misspellings. For this to work the Word2Vec algorithm requires a huge corpus and for the newspapers we had 140GB raw text. So this is probably also the largest word2vec index ever build on a Danish corpus.
Given the list of words return by Word2Vec we then use a Danish dictionary to remove the same word in different forms which is not a OCR error. On the remaining words you just see if the words are close to enough comparing characters to be identified as a misspelling.
Examle: Lets say you use the Word2Vec to find words for ‘banana’ and it returns: hanana, bananas,apple, orange,
You remove bananas using the (english) dictionary since this is not an OCR error. For the three remaining word only ‘hanana’ is close to ‘banana’ and it thus the only mispelling of banana found in this example. Remember the Word2Vec algorithm does care how the words are spelled/misspelled it only uses the semantic context of the words.
You can play with the Word2Vec index on the Danish Newspapers here:
http://labs.statsbiblioteket.dk/dsc/ (remember to select the newspaper corpus)
And this page shows how the dictionary is used to find misspellings:
(change the last words in the url – Danish only sorry…)
Running this algorithm on 1000 random words takes 8 hours (using 20 CPUs) and fixes 84mio. OCR errors though a very few of them are false positives and not OCR errors, but this is very rare compared to true OCR errors. This last step to maximize bad OCR errors and minimize false positives is still in progress… 🙂
The newspaper Archive: http://www2.statsbiblioteket.dk/mediestream/avis