Hardware guesstimation

by

We are currently at the starting phase of two new search-oriented projects at Statsbiblioteket. The frontenders are playing with wireframes and the backenders are queried for hardware requirements. Estimating hardware requirements is tricky though. This post is about our newspaper project while the next will  be about indexing our net archive.

32 million newspaper pages

The digitization of danish newspapers will run for 3 years, during which about 32 million scanned and OCR’ed pages will be delivered. The segmentation (division of the OCR’ed text into sections and articles) appears to be great so we plan on indexing the newspapers at article level. The article count per page varies but 15 articles/page seems like a fair guess. This gives us an estimated index of 15 * 32M = 480 million articles. Loose testing by using sample pages as template and exchanging & generating new terms randomly gives us a loosely estimated index size of 1.5TB. Unfortunately it is not at all simple to provide a high quality guess for what the real index size will be.

The problem with scanned text

We know the enemy and it is dirty OCR. The old newspapers used Fraktur, which is quite poorly recognized with Optical Character Recognition. To make matters worse, the ink from old newspapers bleeds into the paper and 100+ years of storage does leave its mark. The result is very messy text where a lot of new words (which are really the old words just spelled in new and interesting ways doe to faulty character guessing) on each page. The guys at Hathi Trust knows this all too well. So how many unique words can we expect?

Unique words estimation

We have scanning and OCR samples of pages from 1795, 1902 and 2012. Counting the number of unique terms and plotting graphs for the total number of unique terms as the document count rises gives us the following:

unique_terms_1795 unique_terms_1902 unique_terms_2012

So how many unique terms can we expect if we extrapolate the trends from these graphs to 32 million pages? We tried with linear and logarithmic regression and got

Time Sample pages Sample terms Linear 32M Logarithmic 32M
1795 32 14,290 13,789,584,399 73,730
1902 36 47,449 39,738,053,360 237,191
2012 316 85,038 7,594,297,625 310,971

We would like to use Solr fuzzy search in order to match OCR’ed terms with recognition errors. Preliminary testing of fuzzy search indicates that a single Solr box can deliver adequate performance for 1-200 million unique terms. Adequate means that most searches are below 1 second with harder searches taking a few seconds. If the number of unique terms ends up in the multiple billions, we probably have to skip or severely restrict fuzzy search.

As can be seen in the table, depending on the chosen extrapolation algorithm, we will either have several billion or less than one million unique terms in the full corpus. The loosely estimated index size of 1.5TB is for ~4 billion unique terms. Neither linear nor logarithmic extrapolation seems to yield very realistic numbers with the relatively few samples that we have. We need more data!

About these ads

One Response to “Hardware guesstimation”

  1. Tom Burton-West Says:

    Hi Toke,

    Once you have more data, take a look at this paper about estimating the number of unique words as a corpus grows:

    S. Evert and M. Baroni (2006a). Testing the extrapolation quality of word frequency models. In Proceedings of Corpus Linguistics 2005, Birmingham, UK. http://sslmit.unibo.it/~baroni/publications/cl2005/cl-052-pap-final.pdf

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: