As part of our obligation to preserve our online cultural heritage, Statsbiblioteket and Det Kongelige Bibliotek in Denmark continuously harvest the danish web (the *.dk-domains), digitize public danish television, rip all danish-produced music and generally just collect whatever we can get our hands on. The terabytes add up (120TB for the web pages so far, more for television, radio and so on) and the machines are happily harvesting, ripping and wolfing down the bytes into semi-safe storage (2 geographically and architecturally different setups, checksummed, re-checksummed etc.). All fine and dandy.
Except that access to most of the material is rather limited and that search is … well, pretty much non-existing.
Such things tend to change over time, preluded by meetings, committees, deals and whatnot. As technicians, we are normally not directly involved in all the politics surrounding this, but in order to get some concrete arguments, we were asked to try and index some of the harvested web material and do a search demo, where web material was presented together with our normal material (books, cds, articles et al).
The harvested web material is stored in ARC-files, so the obvious choice for a quick test was NutchWAX. Setup was easy, some 100 million documents was indexed (about 2% of the harvested web material) and searches were sub-second on a modest machine. A great success in terms of answering the “is it even feasible to do this?“-question.
The “but does it makes sense to do integrated search for such different data sources as web and library books?“-question could not be answered by this, so naturally we had to hack something together with Summa, our precious hammer. Due to other highly-prioritized assignments, we only had about a week to get it to work, so corners were cut where possible. Using the ARC-reader from Heritrix and the Tika-toolkit for analyzing the wealth of different data, the aptly named Arctika was born. Arctika handled the web stuff and an aggregator handled the integration with our standard library index.
It could use a lot more work, but it worked surprisingly well for a quick hack. We were able to demonstrate everything we wanted: The integrated search made sense, the ranking generally pulled the good stuff to the top (admittedly, tweaking the ranking for different sources would surely be needed for a real application) and the faceting system clearly helped give an overview of material types & sources and provided an easy means to do temporal navigation in the search-result: Limiting searches to a specific period of time is quite usable for investigating the media handling of major events.
So what’s the dark part? Well, legislation. As always. That and money. Harvested web material is sensible, only legally accessible for the selected few professors. On top of that, showing snippets from harvested web pages seems – at the moment – to require compensating the content owners, according to EU-law. Opening up for all the material at once will probably not happen in the foreseeable future.
Happily we don’t need to do everything at once. If we limit the public accessible index to websites from the government and companies, it should be legal to show the search-results and the stored versions (hello continuity). Add the recorded television and radio to the mix, pour in scanned newspapers, integrate with old-school books and presto, we have something great. Danish culture at our fingertips, past and present.
Dreaming, I know. But on the technical level, we just need the green light from the bigwigs to make this happen.
A screenshot, you say? Why, yes, of course. We present this super-cool bling bling interface with a stupendously large amount of interesting information to you. Slightly marred by the need to sensor out some sensible information and the fact that indexing time was capped at half a day to make the deadline.