As a break from the more tech-heavy postings, this will be a loosely structured post with general observations and a few ideas.
We have kilometers of paper at our library and a vision that says to digitize it all. In principle we have a copy of every danish newspaper ever printed. In reality we are, well… actually not that far off. Currently we are starting up the scanning of 32 million pages, which will be delivered at a rate of 50,000 pages/day the next years. The company responsible for scanning is Ninestars. Most of these pages are already on Microfilm so we cheat and have those scanned where possible. This reduces costs tremendously and the end result is, well… actually quite good.
The papers will of course be OCR’ed. Our test scans shows a lot of errors in the old papers that uses fraktur, so it will be a challenge to get that to work properly. The next step up from bare bones OCR is segmentation, where software and/or manual labour groups the text on the pages into articles.
The bits starts rolling over our server’s door step at the end of this year, so we need to find out what to do with them. This article is a list of great things that other people do with scanned papers and some extra ideas. It completely ignores the surrounding mess of politics, economics and immaterial rights.
Looking through the the many online newspapers there seems to be some sort of baseline functionality for paper presentation.
- Full text search
It nearly goes without saying these days, but of course we need search. Preferably fuzzy search due to dirty OCR.
Lucene/Solr-powered basic search for 30M pages or 1-200M relatively short articles is fairly simple to implement and hardware cheap. Fuzzy search is simple but somewhat more costly in processing power. We need to investigate how much.
- Smooth pan & zoom
Google Maps-like zooming, where one pans and zooms directly with the mouse (or uses fingers on a touch screen), goes a long way for compensating a small display area.
OpenSeadragon is cross-browser, open source and easy to use. We created a demo in 2 days.
- Marking of articles and search words
It is not enough to be directed to a page when searching. The found article on the page needs to be marked visually, together with the search words.
OpenSeadragon comes with an API for marking areas using simple CSS-styling.
- Known resource lookup by drilling down in base attributes; publisher, paper name, publishing date, distribution cities.
This can be achieved by displaying the result of a facet search for all material.
Fortunately all basic functionality can be implemented without too much backend work at Statsbiblioteket as we are very familiar with search by Lucene/Solr and as experiments shows that OpenSeadragon is easy to use. The frontend is as always a joker but the basic functionality seems doable.
- Display of relevant image section in search results
This can be seen at The UBC Library.
This does require non-trivial processing by the search-application, but development-wise it is essentially reduced full-page views inserted into the search result.
- Easy correction of OCR
Trove has this extremely integrated. When the user views a page, correction of the OCR is very prominently displayed. They have more than 100,000 correction per day.
Incorporating user corrections is by no means trivial. It requires non-trivial development hours, constant support and that the organization accepts data from non-authoritative sources.
- Detailed statistics of corrections
Trove has hall of fame, easy & public inspection of contributions and more.
Such support for contributions require a substantial development investment, but it is probably also a significant part of Trove’s success with the concept.
- Meta data feedback
Again, Trove is great at this by providing live statistics on searches/day, contributions, collection size and more.
Live statistics are not very hard to extract from logs or by querying the different services. The extra load on the servers is minimal.
- Display of article segments alongside image
Again, again, Trove has a highly usable article list view. Upper Hutt has a variant of this that allows for overview of the full paper, coupled with display of OCR for selected articles.
This functionality is trivial in the backend, as it is just a standard search. Likewise, it should be fairly simple to do in the frontend.
- Classification of articles
Upper Hutt has categories such as advertisement, personals and council briefs. These are directly usable for narrowing searches.
While some heuristics can be applied, automatic classification of articles seems error prone. A more realistic approach is classification by humans.
- Full papers view
Upper Hutt displays a zoomed in view centred on the search result, but zooming all the way out displays all pages from the current paper. Cantonale has a great full-screen experience using the same idea.
This is easily achieved with OpenSeadragon.
- Easy download
Nasjonalbiblioteket provides a small search field inside of their viewer. Entering search words marks them on the displayed pages and makes it possible to pan the view between them.
This functionality is an extension of displaying the relevant part of the scanned page on the first view.
- Simple image adjustment
D.A.D.D. has interactive adjustment of brightness and contrast, which helps tremendously with poor scans.
In principle, this is very easy with CSS-filters, which unfortunately has very poor browser support. It seems that it can be done with OpenSeadragon (see DICOM), but this is something that must be investigated.
- Visual overview of newspapers life
Metelvin has a very nice Gantt-like chart.
This is a semi-static resource, which could either be generated automatically or hand held. No magic, just a neat overview.
- Leaf through simulation
Gallica simulates the experience of reading a physical paper. Besides the visual transition, easy page turning seems very usable.
This is doable with OpenSeadragon and is more a matter of how to avoid cluttering the interface with extra functionality.
Most of the great stuff require some extra work but as most of it can be developed independently of the rest, this seems quite approachable. The really big deal is community involvement, which is also highly interesting from a data-viewpoint.
- Dynamic search-driven time drill down
Acervo shows a visual timeline as part of the search result and Villanova has something like it. Making it easy to zoom in on ranges, narrow to weekends or first weekday in all months would improve this.
Extracting data for visualization of time and providing drill down is driven by faceting. For the backend, this is trivial. All the work is in the frontend.
- Related articles
If the user finds an article describing the building of a new town hall in Aarhus, chances are that the user would also be interested in other articles about new buildings in Aarhus around the same time. Or maybe articles about Arne Jacobsen, one of the two architects. As newspapers are extremely current on their date of publishing, high quality clustering with a strong temporal bias seems probable.
We do not have much experience with soft clustering. It is very hard to say how difficult it would be for us to make.
- Different scaling of text and illustrations
Photos in the older newspapers are greyscale, represented as monochrome raster (black dots of different size). The quality suffers a lot from the process of going from original -> raster -> microfilm -> scanned raw image -> scaled and corrected presentation image. Especially the last stop, the generation of presentation images, can lead to poor images. The high contrast and sharpness that is so beneficial for text tend to work against photos where a softer approach is preferable. By processing photo- and text-areas differently, the optimal strategy can be chosen for both cases.
This is fairly simple to accomplish if the location of images is known. If not, a reverse strategy of “everything that is not text must be images” can be used, although that will lead to some text areas being image-processed and thereby with lower readability, due to the OCR not being 100% correct. We do not currently know if illustrations will be marked as such in the OCR.
- Achievements and similar for community contributors
Building further on Trove’s excellent community support, the concept of achievements might work as a motivator. Achievements are small icons representing “100 lines corrected”, “1000 lines corrected”, “First tag”, “100 lines corrected in a day”, “5 different newspapers corrected”, “One paper fully corrected” and so on.
For achievements to work, they need to be presented the instant that a contributor satisfies the requirement. This means that they must be tightly coupled to the underlying contributor structure.