Analysis of the material in our Net Archive index is challenging for many reasons. One of them is agreeing on what we’re looking at: It is rarely the full amount of data; our researchers wants to define subsets. Subsets can be sliced in different ways: At (W)ARC level, as manually selection of specific URLs harvested at specific times, as a graph of links etc. They can also be specified with Solr. Using Solr, there are two common and technically distinct ways of specifying subsets:
- Subset definition by simple boolean expressions: “All material from the domain example.com harvested in 2011” or “All PDF files that mentions ‘disneyland after dark’ and has a link to domain:d-a-d.dk”.
Such subsets are extremely easy to work with in Solr, as they are just filters: They are very fast and works fully with faceting, sorting and grouping.
- Subset definition by weighting: “One instance of all URLs, de-duplicated by taking the one as close to June 1., 2011 as possible”.
For some functionality, such as extracting simple top-X results, this can be handled by grouping in Solr. Unfortunately this scales poorly with faceting, groups within groups is not a Solr feature and some functionality, such as counting the number of distinct groups (the result set size) is not possible to do reliably with SolrCloud.
So what can we do about the pesky subset-by-weighting? The idea is to add support for exhaustive ID filters, where exhaustive means potentially all documents at single-document granularity level. With such a filter we could define arbitrary subsets on an existing index, not even constrained by Solr’s own powers of expression. It looks a bit daunting with billions of documents, but given some constraints it seems doable.
Creating the document list
The first problem is to determine which documents should be in the filter. If the documents are defined outside of Solr, this could just be a list of [URL, timestamp] or similar parameters uniquely identifying the documents. If the documents are to be aggregated from the SolrCloud, some manual work might be needed.
Let’s see how to do it for “One instance of all URLs, de-duplicated by taking the one as close to June 1., 2011 as possible”. In a single-shard setup it might be possible to do it with grouping on URL (1 entry/group) and sorting by distance. As stated before, this scales badly or not at all with SolrCloud and things like faceting or paging through result sets. But the principle itself is simple:
- For each shard, extract all the documents as tuples of [URL, date, ID], sorted by URL. This is very easily done with deep paging. If further restrictions, such as “only documents from 2010-2012” are needed, this is done by applying a filter before extraction.
- Process the resulting lists individually and extend the tuples with shard-ID to [URL, date, ID, shardID].
- Merge-sort the list of tuples, primarily by URL, secondarily by closeness to 2011-06-01.
- Prune the aggregated list by only keeping the first entry for each URL.
- Split the aggregated list into shard-specific lists, using the shard-ID part of the tuples, keeping only the IDs.
The astute reader will have noticed that steps 1-5 can be performed in a streaming manner: The only limit to the size of the resulting ID-lists is storage space.
At this point we have a list of entries uniquely defining documents, for each shard. We need a Solr component that can
- Take a user-assigned filter-ID and an URL to the list of document-definitions.
- Create an empty docset/bitmap, the same kind a standard filter uses.
- For each entry in the list of document-definitions, resolve the internal docID for the document and set that bit in the bitmap.
- Store the resulting bitmap for later usage, just like a standard filter, with the user-assigned filter-ID as key.
- When the user issues queries specifying filter-IDs, the docID bitmaps are fetched and applied as a standard Solr filters, with the same minimal processing overhead.
There is a lot of potential for optimization here:
- It would be possible for each line in the list of document-definitions to be viewed as a full-blown Solr query, seeing the whole list as a giant boolean OR query. Lines could be processed in batches by enclosing each line in the batch in parentheses and interleaving with OR, before processing it as a plain query.
- Treating each line as a query allows the filter to be specified any the user wants: While the granularity of the list goes down to document level, it also goes up to the full index.
- It would be possible to see each line as a term that must match a single field (normally the ID field). Again requests could be batched, this time using the very efficient TermsQuery.
- For ID-based lists, pre-sorting the list would open up for very efficient processing: The Solr query mechanism could be bypassed and the internal structure for the ID-field could be traversed sequentially.
Once again this can be streamed (except for the sorting idea), with no limit on the size of the list of document-definitions. Theoretically this could be tied to the generation of the corpus, avoiding temporary storage of the lists. But that would be a later project.
What we get
- The ability for researchers to specify shared subsets of any size at document granularity, without changing the indexes.
- Besides relevance ranking, using such filters would be equivalent to building a whole new index from the defined subset.
- High initial cost of defining and creating filters, but extremely low subsequent overhead of using them.
Creating the document lists would likely be very heavy, but as the premise is to create subsets of the corpus, this would normally be done once per subset defined by the researchers.
Creating the filter itself would likely be less heavy, but the processing of hundreds of millions of tiny queries, even with batching, is measured in minutes or hours. Due to the nature of filters, this step must be repeated if the index changes, so exhaustive filters would only be feasible with a rarely-updated index.Some of Solr’s build-in functionality, such as the distance to a given date for the example case, would have to be duplicated in the external merge-sorter.
Constraining searches by large lists of IDs is not a new idea.
- Defining the IDs at query time is not feasible at net archive scale; even with bit packing a request could potentially be measured in gigabytes.
- Treating the task as a security filter problem holds promises, at least for the filtering part.
- Adding a token to the documents in a subset would work, but it makes experimentation very heavy and (of course) requires the indexes to be updated.
- The Terms Filter Lookup in Elasticsearch seems quite like the persistent filter described above. As does SOLR-1715, which unfortunately is talk-only at this point.
- Update 20140205: The example of generating the document list seems to fit extremely well with Heliosearch’s Streaming Aggregation, which will hopefully be part of Solr with SOLR-7082.