Category Archives: Uncategorized

Prebuild Big Data Word2Vec dictionaries

                   Prebuild and trained Word2Vec dictionaries ready for use Two different prebuild big data Word2Vec dictionaries has been added to LOAR (Library Open Access Repository) for download. These dictionaries are build from the text of 55,000 e-books from Project Gutenberg … Continue reading

Posted in Uncategorized | Leave a comment

SolrWayback software bundle has been released

The SolrWayback software bundle can be used to search and playback archived webpages in Warc format. It is an out of the box solution with index workflow, Solr and Tomcat webserver and a free text search interface with playback functionality. … Continue reading

Posted in Uncategorized | Leave a comment

SolrWayback Machine

Another ‘google innovation week’ at work has produced the SolrWayback Machine. It works similar to the Internet Archive: Wayback Machine (https://archive.org/web/) and can be used to show harvested web content (Warc files).  The Danish Internet Archive has over 20billion harvested … Continue reading

Posted in Uncategorized | 1 Comment

juxta – image collage with metadata

Creating large collages of images to give a bird’s eye view of a collection seems to be gaining traction. Two recent initiatives: The New York Public Library has a very visually pleasing presentation of public domain digitizations, but with a … Continue reading

Posted in Uncategorized | Leave a comment

Automated improvement of search in low quality OCR using Word2Vec

This abstract has been accepted for Digital Humanities in the Nordic Countries 2nd Conference, http://dhn2017.eu/ In the Danish Newspaper Archive[1] you can search and view 26 million newspaper pages. The search engine[2] uses OCR (optical character recognition) from scanned pages … Continue reading

Posted in Uncategorized | 2 Comments

70TB, 16b docs, 4 machines, 1 SolrCloud

At Statsbiblioteket we maintain a historical net archive for the Danish parts of the Internet. We index it all in Solr and we recently caught up with the present. Time for a status update. The focus is performance and logistics, … Continue reading

Posted in Hacking, Low-level, Performance, Solr, Statsbiblioteket, Uncategorized | 6 Comments

Prototype demo for OCR postfix in Danish Newspapers

In The Danish Newspaper Archive you can search in 25million newspaper pages and view the pages. The search engine uses OCR (optical character recognition) from scanned pages but often the software reading the text from the scanned images makes reading … Continue reading

Posted in Uncategorized | Leave a comment