SolrWayback Machine

Another ‘google innovation week’ at work has produced the SolrWayback Machine. It works similar to the Internet Archive: Wayback Machine (https://archive.org/web/) and can be used to show harvested web content (Warc files).  The Danish Internet Archive has over 20billion harvested web objects and takes 1 Petabyte of storage.

The SolrWayback engine require you have indexed the Warc files using the Warc-indexer tool from British Library. (https://github.com/ukwa/webarchive-discovery/tree/master/warc-indexer).

It is quite fast and comes with some additional features as well:

  •  Image search similar to google images
  •  Link graphs showing  links (ingoing/outgoing) for domains using the D3 javascript framework.
  •  Raw download of any harvested resource from the binary Arc/Warc file.

Unfortunately  the collection is not available for the public so I can not show you the demo. But here is a few pictures from the SolrWayback machine.

solrwayback_demo

solrwayback_linkgraph.png

SolrWayback at GitHub: https://github.com/netarchivesuite/solrwayback/

Advertisements

About thomasegense

Thomas Egense Mathematician Works at The State and University Library, Denmark
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s