Category Archives: Uncategorized

SolrWayback Machine

Another ‘google innovation week’ at work has produced the SolrWayback Machine. It works similar to the Internet Archive: Wayback Machine (https://archive.org/web/) and can be used to show harvested web content (Warc files).  The Danish Internet Archive has over 20billion harvested … Continue reading

Posted in Uncategorized | 1 Comment

juxta – image collage with metadata

Creating large collages of images to give a bird’s eye view of a collection seems to be gaining traction. Two recent initiatives: The New York Public Library has a very visually pleasing presentation of public domain digitizations, but with a … Continue reading

Posted in Uncategorized | Leave a comment

Automated improvement of search in low quality OCR using Word2Vec

This abstract has been accepted for Digital Humanities in the Nordic Countries 2nd Conference, http://dhn2017.eu/ In the Danish Newspaper Archive[1] you can search and view 26 million newspaper pages. The search engine[2] uses OCR (optical character recognition) from scanned pages … Continue reading

Posted in Uncategorized | 1 Comment

70TB, 16b docs, 4 machines, 1 SolrCloud

At Statsbiblioteket we maintain a historical net archive for the Danish parts of the Internet. We index it all in Solr and we recently caught up with the present. Time for a status update. The focus is performance and logistics, … Continue reading

Posted in Hacking, Low-level, Performance, Solr, Statsbiblioteket, Uncategorized | 6 Comments

Prototype demo for OCR postfix in Danish Newspapers

In The Danish Newspaper Archive you can search in 25million newspaper pages and view the pages. The search engine uses OCR (optical character recognition) from scanned pages but often the software reading the text from the scanned images makes reading … Continue reading

Posted in Uncategorized | Leave a comment

2D visualization of high dimensional word embeddings

In this blog post I tried to make an method for a computer to  read a text and analyse the characters and then make a 2D visualization of the similarity of the characters. To achieve this I am using the … Continue reading

Posted in Uncategorized | 1 Comment

CDX musings

This is about web archiving, corpus creation and replay of web sites. No fancy bit fiddling here, sorry. There is currently some debate on CDX, used by the Wayback Engine, Open Wayback and other web archive oriented tools, such as … Continue reading

Posted in Uncategorized | Leave a comment