An overview of website harvest data is important for both research and development operations in the netarchive team at Det Kgl. Bibliotek. In this post we present a recent frontend visualisation widget we have made.
From the SolrWayback Machine we can extract an array of dates of all harvests of a given URL. These dates are processed in the browser into a data object containing the years, months, weeks and days to enable us to visualise the data. Futhermore the data is given an activity level from 0-4.
The high-level overview seen below is the year-month graph. Each cell is coloured based on the activity level relative to the number of harvests in the most active month. For now we use a linear calculation so gray means no activity, activity level 1 is 0-25% of the most active month, and level 4 is 75-100% of the most active month. As GitHub fans, we have borrowed the activity level colouring from the user commit heat map.
We can visualise a more detailed view of the data as either a week-day view of the whole year, or as a view of all days since the first harvest. Clicking one of these days reveals all the harvests for the given day, with links back to SolrWayback to see a particular harvest.
In the graph above we see the days of all weeks of 2009 as vertical rows. The same visualisation can be made for all harvest data for the URL, as seen below (cut off before 2011, for this blog post).
There are both advantages and disadvantages to using the browser-based data calculation. One of the main advantages is a very portable frontend application. It can be used with any backend application that outputs an array of dates. The initial idea was to make the application usable for several different in-house projects. Drawbacks to this approach is, of course, the scalability. Currently the application processes 25.000 dates in about 3-5 seconds on the computer used to develop the application (a 2016 quad core Intel i5).
The application uses the frontend library VueJS and only one other dependency, the date-fns library. It is completely self-contained and it is included in a single script tag, including styles.
Ideas for further development.
We would like to expand this to also include both:
- multiple URLs, which would be nice for resources that have changed domain, subdomain or protocol over time, e.g. the URL http://pol.dk, http://www.pol.dk and https://politiken.dk could be used for the danish newspaper Politiken.
- domain visualisation for all URLs on a domain. A challenge here will of course be the resources needed to process the data in the browser. Perhaps a better calculation method must be used – or a kind of lazy loading.