Net archive indexing, round 2

Using our experience from our initial net archive search setup, Thomas Egense and I have been tweaking options and adding patches to the fine webarchive-discovery from UKWA for some weeks. We will be re-starting indexing Real Soon Now. So what have we learned?

  • Stored text takes up a huge part of the index: Nearly half of the total index size. The biggest sinner is not surprisingly the content field, but we need that for highlighting and potentially text extraction from search results. As we have discovered that we can avoid storing DocValued fields, at the price of increased document retrieval time, we have turned off storing for several fields.
  • DocValue everything! Or at least a lot more than we did initially. Enabling DocValues for a field and getting low-overhead faceting turned out to be a lot disk-space-cheaper than we thought. As every other feature request from the researchers seems to be “We would also like to facet on field X”, our new strategy should make them at least half happy.
  • DocValues are required for some fields. Due to internal limits on facet.method=fc without DocValues, it is simply not possible to do faceting if the number of references gets high.
  • Faceting on outgoing links is highly valuable. Being able to facet on links makes it possible to generate real-time graphs for interconnected websites. Links with host- or domain granularity are easily handled and there is no doubt that those should be enabled. Based on posivitive experimental results with document-granularity links faceting (see section below), we will also be enabling that.
  • The addition of performance instrumentation made it a lot easier for us to prioritize features. We simply do not have time for everything we can think of and some specific features were very heavy.
  • Face recognition (just finding the location of faces in images, not guessing the persons)  was an interesting feature, but with a so-so success rate. Turning it on for all images would triple our indexing time and we have little need for sampling in this area, so we will not be doing it at all for this iteration.
  • Most prominent colour extraction was only somewhat heavy, but unfortunately the resulting colour turned out to vary a great deal depending on adjustment of extraction parameters. This might be useful if a top-X of prominent colours were extracted, but for now we have turned off this feature.
  • Language detection is valuable, but processing time is non-trivial and rises linear with the number of languages to check. We lowered the number of detected languages from 20 to 10, pruning the more obscure (relative to Danish) languages.
  • Meta-data about harvesting turned out to be important for the researchers. We will be indexing the ID of the harvest-job used for collecting the data, the institution responsible and some specific sub-job-ID.
  • Disabling of image-analysis features and optimization of part of the code-base means faster indexing. Our previous speed was 7-8 days/shard, while the new one is 3-4 days/shard. As we has also doubled our indexing hardware capacity, we expect to do a full re-build of the existing index in 2 months and catching up to the present within 6 months.
  • Our overall indexing workflow, with dedicated builders creating independent shards of a fixed size, worked very well for us. Besides some minor tweaks, we will not be changing this.
  • We have been happy with Solr 4.8. Solr 5 is just out, but as re-indexing is very costly for us, we do not feel comfortable with a switch at this time. We will do the conservative thing and stick to the old Solr 4-series, which currently means Solr 4.10.4.

Document-level links faceting

The biggest new feature will be document links. This is basically all links present on all web pages at full detail. For a single test shard with 217M documents / 906GB, there were 7 billion references to 640M unique links, the most popular link being used 2.4M times. Doing a full faceted search on *:* was understandable heavy at around 4 minutes, while ad hoc testing of “standard” searches resulted in response times varying from 50 ms to 3500 ms. Scaling up to 25 shards/machine, it will be 175 billion references to 16 billion values. It will be interesting to see the accumulated response time.

We expect this feature to be used to generate visual graphs of interconnected resources, which can be navigated in real-time. Or at least you-have-to-run-to-get-coffee-time. For the curious, here is the histogram for links in the test-shard:

References #terms
1 425,799,733
2 85,835,129
4 52,695,663
8 33,153,759
16 18,864,935
32 10,245,205
64 5,691,412
128 3,223,077
256 1,981,279
512 1,240,879
1,024 714,595
2,048 429,129
4,096 225,416
8,192 114,271
16,384 45,521
32,768 12,966
65,536 4,005
131,072 1,764
262,144 805
524,288 789
1,048,576 123
2,097,152 77
4,194,304 1


About Toke Eskildsen

IT-Developer at with a penchant for hacking Lucene/Solr.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s