We are currently at the starting phase of two new search-oriented projects at Statsbiblioteket. The frontenders are playing with wireframes and the backenders are queried for hardware requirements. Estimating hardware requirements is tricky though. This post is about our newspaper project while the next will be about indexing our net archive.
32 million newspaper pages
The digitization of danish newspapers will run for 3 years, during which about 32 million scanned and OCR’ed pages will be delivered. The segmentation (division of the OCR’ed text into sections and articles) appears to be great so we plan on indexing the newspapers at article level. The article count per page varies but 15 articles/page seems like a fair guess. This gives us an estimated index of 15 * 32M = 480 million articles. Loose testing by using sample pages as template and exchanging & generating new terms randomly gives us a loosely estimated index size of 1.5TB. Unfortunately it is not at all simple to provide a high quality guess for what the real index size will be.
The problem with scanned text
We know the enemy and it is dirty OCR. The old newspapers used Fraktur, which is quite poorly recognized with Optical Character Recognition. To make matters worse, the ink from old newspapers bleeds into the paper and 100+ years of storage does leave its mark. The result is very messy text where a lot of new words (which are really the old words just spelled in new and interesting ways doe to faulty character guessing) on each page. The guys at Hathi Trust knows this all too well. So how many unique words can we expect?
Unique words estimation
We have scanning and OCR samples of pages from 1795, 1902 and 2012. Counting the number of unique terms and plotting graphs for the total number of unique terms as the document count rises gives us the following:
So how many unique terms can we expect if we extrapolate the trends from these graphs to 32 million pages? We tried with linear and logarithmic regression and got
|Time||Sample pages||Sample terms||Linear 32M||Logarithmic 32M|
We would like to use Solr fuzzy search in order to match OCR’ed terms with recognition errors. Preliminary testing of fuzzy search indicates that a single Solr box can deliver adequate performance for 1-200 million unique terms. Adequate means that most searches are below 1 second with harder searches taking a few seconds. If the number of unique terms ends up in the multiple billions, we probably have to skip or severely restrict fuzzy search.
As can be seen in the table, depending on the chosen extrapolation algorithm, we will either have several billion or less than one million unique terms in the full corpus. The loosely estimated index size of 1.5TB is for ~4 billion unique terms. Neither linear nor logarithmic extrapolation seems to yield very realistic numbers with the relatively few samples that we have. We need more data!