Another ‘google innovation week’ at work has produced the SolrWayback Machine. It works similar to the Internet Archive: Wayback Machine (https://archive.org/web/) and can be used to show harvested web content (Warc files). The Danish Internet Archive has over 20billion harvested web objects and takes 1 Petabyte of storage.
The SolrWayback engine require you have indexed the Warc files using the Warc-indexer tool from British Library. (https://github.com/ukwa/webarchive-discovery/tree/master/warc-indexer).
It is quite fast and comes with some additional features as well:
- Image search similar to google images
- Raw download of any harvested resource from the binary Arc/Warc file.
Unfortunately the collection is not available for the public so I can not show you the demo. But here is a few pictures from the SolrWayback machine.
SolrWayback at GitHub: https://github.com/netarchivesuite/solrwayback/