Net Archive Search building blocks

An extremely webarchive-discovery and Statsbiblioteket centric description of some of the technical possibilities with Net Archive Search. This could be considered internal documentation, but we like to share.

There are currently 2 generations of indexes at Statsbiblioteket: v1 (22TB) & v2 (8TB). External access is currently to v1. As the features of v2 is a superset of v1, v1 will be disabled as soon as v2 catches up in terms of the amount of indexed material. ETA: July or August 2015.

Aggregation fields

The following fields has the DocValues-option enabled, meaning that it is possible to export them efficiently as well as doing sorting, grouping and faceting at a low memory cost.

Network

Field v1 v2 Description
url_norm * The resource URL, lightly normalised (lowercased, www removed, etc.) the same way as links. The 2 fields together can be used to generate graphs of resource interconnections.

This field is also recommended for grouping. To get unique URLs in their earliest versions, do

group=true&group.field=url_norm&group.sort=crawl_date+asc

links * Outgoing links from web pages, normalised the same way as url_norm. As cardinality is non-trivial (~600M values per TB of index), it is strongly recommended to enable low-memory mode if this field is used for faceting:
facet=true&facet.field=links&f.links.facet.sparse.counter=nplanez
host * * The host part of the URL for the resource. Example: vejret.tv2.dk.
domain * * The domain part of the URL for the resource. Example: tv2.dk.
links_hosts * The host part of all outgoing links from web pages.
links_domain * The domain part of all outgoing links from web pages.
links_public_suffixes * The suffix of all outgoing links from web pages. Samples: dk, org.uk, nu

Time and harvest-data

Field v1 v2 Description
last_modified * As reported by the web server. Note that this is not a reliable value as servers are not reliable.
last_modified_year * The year part of sort_modified
crawl_date * The full and authoritative timestamp for when the resource was collected. For coarser grained (and faster) statistics, consider using crawl_date_year.
crawl_date_year * The year part of the crawl_date.
publication_date * The publication date as reported by the web server. Not authoritative.
publication_date_year * The year part of the publication_date.
arc_orig * Where the ARC file originated from. Possible values are sb & kb. If used for faceting, it is recommended to use enum: facet=true&facet.field=arc_orig&f.arc_orig.facet.method=enum.
arc_job * The ID of the harvest job as used by the Danish Net Archive when performing a crawl.

Content

Field v1 v2 Description
url * * The resource URL as requested by the harvester. Consider using url_norm instead to reduce the amount of noise.
author * As stated in PDFs, Word documents, presentations etc. Unfortunately the content is highly unreliably and with a high degree of garbage.
content_type * The MIME type returned by the web server.
content_length * The size of the raw resource, measured in bytes. Consider using this with Solr range faceting.
content_encoding * The character set for textual resources.
content_language * Auto-detected language of the resource. Unreliable for small text samples, but quite accurate on larger ones.
content_type_norm * * Highly normalised content type. Possible values are: html, text, pdf, other, image, audio, excel, powerpoint, video& word.
content_type_ version, full, tika, droid, served, ext * Variations of content type, resolved from servers and third party tools.
server * The web server, as self-reported.
generator * The web page generator.
elements_used * All HTML-elements used on web pages.

Search & stored fields

It is not recommended to sort, group or facet on the following fields. If it is relevant to do so, DocValues can be enabled for v3.

Field v1 v2 Description
id * * The unique Solr-ID of the resource. Used together with highlighting or for graph-exploration.
source_files_s * The name of the ARC file and the offset of the resource. Sample: 43084-88-20090304190229-00002-kb-prod-wb-001.kb.dk.arc@54173743.

This can be used as a CDX-lookup replacement by limiting the fields returned:

fl=url,source_files_s&rows=500.

arc_harvest * The harvest-ID from the crawler.
hash * SHA1-hash of the content. Can be used for finding exact duplicates.
ssdeep_hash_bs_3, ssdeep_hash_bs_6, ssdeep_hash_ngram_bs_3, ssdeep_hash_ngram_bs_6 * Fuzzy hashes. Can be used for finding near-duplicates.
content * * Available as content_text in v1. The full extracted text of the resource. Used for text-mining or highlighting:

hl=true&hl.fl=content.

CDX-alternative

  1. Get core data for a single page:
    .../select?
    q=url:"http://www.dr.dk/" crawl_year:2010 content_type_norm:html&
    rows=1&fl=url,crawl_date,source_file_s&wt=json
    

    gives us

    "docs": [
          {
            "source_file_s": "86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735",
            "url": "http://www.dr.dk/",
            "crawl_date": "2010-06-18T14:33:29Z"
          }
        ]
    
  2. Request the resource 86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735 from storage, extract all links to images, css etc. The result is a list of URLs like http://www.dr.dk/design/www/global/img/global/DRLogos/DR_logo.jpg.
  3. Make a new request for the URLs from #2, grouped by unique URL, sorted by temporal distance to the originating page:
    .../select?
    q=url:("http://www.dr.dk/" OR 
           "http://www.dr.dk/design/www/global/img/global/DRLogos/DR_logo.jpg")&
    rows=5&fl=url,crawl_date,source_file_s&wt=json&
    group=true&group.field=url_norm&
    group.sort=abs(sub(ms(2010-06-18T14:33:29Z), crawl_date)) asc
    

    gives us

    "groups": [
            {
              "groupValue": "http://dr.dk/design/www/global/img/global/drlogos/dr_logo.jpg",
              "doclist": {
                "numFound": 331,
                "start": 0,
                "docs": [
                  {
                    "source_file_s": "87154-32-20100624134901-00003-sb-prod-har-005.arc@7259371",
                    "url": "http://www.dr.dk/design/www/global/img/global/DRLogos/DR_logo.jpg",
                    "crawl_date": "2010-06-24T13:51:10Z"
                  }
                ]
              }
            },
            {
              "groupValue": "http://www.dr.dk/",
              "doclist": {
                "numFound": 796,
                "start": 0,
                "docs": [
                  {
                    "source_file_s": "86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735",
                    "url": "http://www.dr.dk/",
                    "crawl_date": "2010-06-18T14:33:29Z"
                  }
                ]
              }
            }
          ]
    

Et voilà: Reconstruction of a webpage from a given point in time, using only search and access to the (W)ARC-files.

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, open source, Solr. Bookmark the permalink.

2 Responses to Net Archive Search building blocks

  1. Manuel Lenormand says:

    Hi there,
    I wonder how satisfied are you with ssdeep fuzzy hashing for near deduplications.
    You’ve implemented them as update processors? How much tuning did you have to do? How did you decide on the 4 different hashing values?

  2. Toke Eskildsen says:

    Sorry, Manuel, but we are not (yet) doing active de-duplication. The hashes are there because 1) UKWA found them useful and 2) We will experiment with them at some point.

    You could try asking Andy Jackson at UKWA, which knows a lot more about hashing and net archive material than I do.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s