This is about web archiving, corpus creation and replay of web sites. No fancy bit fiddling here, sorry.
There is currently some debate on CDX, used by the Wayback Engine, Open Wayback and other web archive oriented tools, such as Warcbase. A CDX Server API is being formulated, as is the CDX format itself. Inspired by a post by Kristinn Sigurðsson, here comes another input.
CDX Server API
There is an ongoing use case-centric discussion of needed features for a CDX API. Compared to that, the CDX Server API – BETA seems a bit random. For example: A feature such as regexp-matching on URLs can be very heavy on the backend and open op for easy denial of service (intentional as well as unintentional). It should only be in the API if it is of high priority. One way of handling all wishes is to define a core API with the bare necessities and add extra functionality as API extensions. What is needed and what is wanted?
The same weighing problem can be seen for the fields required for the server. Kristinn discusses CDX as format and boils the core fields down to canonicalized URL, timestamp, original URL, digest, record type, WARC filename and WARC offset. Everything else is optional. This approach matches the core vs. extensions functionality division.
With optional features and optional fields, a CDX server should have some mechanism for stating what is possible.
URL canonicalization and digest
The essential lookup field is the canonicalised URL. Unfortunately that is also the least formalised, which is really bad from an interoperability point of view. When the CDX Server API is formalised, a strict schema for URL canonicalisation is needed.
Likewise, the digest format needs to be fixed, to allow for cross-institution lookups. As the digests do not need to be cryptographically secure, the algorithm chosen (hopefully) does not become obsolete with age.
It would be possible to allow for variations of both canonicalisation and digest, but in that case it would be as extensions rather than core.
CDX (external) format
CDX can be seen as a way of representing a corpus, as discussed on the RESAW workshop in december 2015.
- From a shared external format perspective, tool-specific requirements such as sorting of entries or order of fields are secondary or even irrelevant.
- Web archives tend to contain non-trivial amounts of entries, so the size of a CDX matters.
With this in mind, the pure minimal amount of fields would be something like original URL, timestamp and digest. The rest is a matter of expanding the data from the WARC files. On a more practical level, having the WARC filename and WARC offset is probably a good idea.
The thing not to require is the canonicalized URL: It is redundant, as it can be generated directly from the original URL, and it unnecessarily freezes the canonicalization method to the CDX format.
Allowing for optional extra fields after the core is again pragmatic. JSON is not the most compact format when dealing with tables of data, but it has the flexibility of offering entry-specific fields. CDXJ deals with this, although it does not specify the possible keys inside of the JSON blocks, meaning that the full CDX file has to be iterated to get those keys. There is also a problem of simple key:value JSON entries, which can be generically processed, vs. complex nested JSON structures, which requires implementation-specific code.
CDX (internal) format
Having a canonicalized and SURTed URL as the primary field and having the full CDX file sorted is an optimization towards specific tools. Kris touches lightly on the problem with this by suggesting that the record type might be better positioned as the secondary field (as opposed to timestamp) in the CDX format.
It follows easily that the optimal order or even representation of fields depends on tools as well as use case. But how the tools handle CDX data internally really does not matter as long as they expose the CDX Server API correctly and allows for export in an external CDX format. The external format should not be dictated by internal use!
CDX (internal vs. external) format
CDX is not a real format yet, but tools do exist and they expects some loosely defined common representation to work together. As such it is worth considering if some of the traits of current CDXes should spill over to an external format. Primarily that would be the loosely defined canonicalized URL as first field and the sorted nature of the files. In practice that would mean a near doubling of file size due to the redundant URL representation.
The sorted nature of current CDX files has two important implications: Calculating the intersections between two files is easy and lookup on primary key (canonicalised URL) can algorithmically be done in O(log n) time using binary search.
In reality, simple binary search works poorly on large datasets, due to the lack of locality. This is worsened by slow storage types such as spinning drives and/or networked storage. There are a lot of tricks to remedy this, from building indexes to re-ordering the entries. The shared theme is that the current non-formalised CDX files are not directly usable: They require post-processing and extensions by the implementation.
The take away is that the value of a binary search optimized external CDX format is diminishing as scale goes up and specialized implementations are created. Wholly different lookup technologies, such as general purpose databases or search engines, has zero need for the format-oriented optimizations.