For a change of pace: A not too technical tale of my recent visit to England.
The people behind IIPC Technical Training Workshop – London 2015 had invited yours truly as a speaker and participant in the technical training. IIPC stands for International Internet Preservation Consortium and I were to talk about using Solr for indexing and searching preserved Internet resources. That sounded interesting and Statsbiblioteket encourages interinstitutional collaboration, so the invitation was gladly accepted. Some time passed and British Library asked if I might consider arriving a few days early and visit their IT development department? Well played, BL, well played.
I kid. For those not in the know, British Library made the core software we use for our Net Archive indexing project and we are very thankful for that. Unfortunately they do have some performance problems. Spending a few days, primarily talking about how to get their setup to work better, was just reciprocal altruism working. Besides, it turned out to be a learning experience for both sides.
At British Library, Boston Spa
The current net archive oriented Solr setups at British Library is using SolrCloud with live indexes on machines with spinning drives (aka harddisks) and a – relative to index size – low amount of RAM. At Statsbiblioteket, our experience tells us that such setups generally have very poor performance. Gil Hoggarth and I discussed Solr performance at length and he was tenacious on exploring every option available. Andy Jackson partook in most of the debates. Log file inspections and previous measurements from the Statsbiblioteket setups seemed to sway them in favour of different base hardware, or to be specific: Solid State Drives. The open question is how much such a switch would help or if it would be a better investment to increase the amount of free memory for caching.
- A comparative analysis of performance with spinning drives vs. SSDs for multi-TB Solr indexes on machines with low memory would help other institutions tremendously, when planning and designing indexing solutions for net archives.
- A comparative analysis of performance with different amounts of free memory for caching, as a fraction of index size, for both spinning drives and SSDs, would be beneficial on a broader level; this would give an idea of how to optimize bang-for-the-buck.
Logistically the indexes at British Library are quite different from the index at Statsbiblioteket: They follow the standard Solr recommendation and treats all shards as a single index, both for index and search. At Statsbiblioteket, shards are build separately and only treated as a whole index at search time. The live indexes at British Library have some downsides, namely re-indexing challenges, distributed indexing logistics overhead and higher hardware requirements. They also have positive features, primarily homogeneous shards and the ability to update individual documents. The updating of individual documents is very useful for tracking meta-data for resources that are harvested at different times, but have unchanged content. Tracking of such content, also called duplicate handling, is a problem we have not yet considered in depth at Statsbiblioteket. One of the challenges of switching to static indexes is thus:
- When a resource is harvested multiple times without the content changing, it should be indexed in such a way that all retrieval dates can be extracted and such that the latest (and/or the earliest?) harvest date can be used for sorting, grouping and/or faceting.
One discussed solution is to add a document for each harvest date and use Solr’s grouping and faceting features to deliver the required results. The details are a bit fluffy as the requirements are not strictly defined.
At the IIPC Technical Training Workshop, London 2015
The three pillars of the workshop were harvesting, presentation and discovery, with the prevalent tools being Heritrix, Wayback and Solr. I am a newbie in two thirds of this world, so my outsider thoughts will focus on discovery. Day one was filled with presentations, with my Scaling Net Archive Indexing and Search as the last one. Days two and three were hands-on with a lot of discussions.
As opposed to the web archive specific tools Heritrix and Wayback, Solr is a general purpose search engine: There is not yet a firmly established way of using Solr to index and search net archive material, although the work from UKWA is a very promising candidate. Judging by the questions asked at the workshop, large scale full-text search is relatively new in the net archive world and as such the community lacks collective experience.
Two large problems of indexing net archive material is analysis and scaling. As stated, UKWA has the analysis part well in hand. Scaling is another matter: Net archives typically contains billions of documents, many of them with a non-trivial amount of indexable data (webpages, PDFs, DOCs etc). Search responses ideally involve grouping or faceting, which requires markedly more resources than simple search. Fortunately, at least from a resource viewpoint, most countries does not allow harvested material to be made available to the general public: The number of users and thus concurrent requests tend to be very low.
General recommendations for performant Solr systems tend to be geared towards small indexes or high throughput, minimizing the latency and maximizing the number of requests that can be processed by each instance. Down to Earth, the bottleneck tend to be random reads from the underlying storage, easily remedied by adding copious amounts of RAM for caching. While the advice arguable scales to net archive indexes in the multiple TB-range, the cost of terabytes of RAM, as well as the number of machines needed to hold them, is often prohibitive. Bearing in mind that the typical user groups on net archives consists of very few people, the part about maximizing the number of supported requests is overkill. With net archives as outliers in the Solr world, there is very little existing shared experience to provide general recommendations.
- As hardware cost is a large fraction of the overall cost of doing net archive search, in-depth descriptions of setups are very valuable to the community.
Measurements from British Library as well as Statsbiblioteket shows that faceting on high cardinality fields is a resource hog when using SolrCloud. This is problematic for exploratory use of the index. While it can be mitigated with more hardware or software optimization, switching to heuristic counting holds promises of very large speed ups.
- The performance benefits and the cost in precision of approximate search results should be investigated further. This area is not well-explored in Solr and mostly relies on custom implementations.
On the flipside of fast exploratory access is the extraction of large result sets for further analysis. SolrCloud does not scale for certain operations, such as deep paging within facets and counting of unique groups. Certain operations, such as percentiles in the AnalyticsComponent, are not currently possible. As the alternative to using the index tend to be very heavy Hadoop processing of the raw corpus, this is an area worth investing in.
- The limits of result set extractions should be expanded and alternative strategies, such as heuristic approximation and per-shard processing with external aggregation, should be attempted.
On a personal note
Visiting British Library and attending the IIPC workshop was a blast. Being embedded in tech talk with intelligent people for 5 days was exhausting and very fulfilling. Thank you all for the hospitality and for pushing back when my claims sounded outrageous.