DocValues jump tables in Lucene/Solr 8

Lucene/Solr 8 is about to be released. Among a lot of other things is brings LUCENE-8585, written by your truly with a heap of help from Adrien Grand. LUCENE-8585 introduces jump-tables for DocValues, is all about performance and brings speed-ups ranging from worse than baseline to 1000x, extremely dependent on index and access pattern.

This is a follow-up post to Faster DocValues in Lucene/Solr 7+. The previous post contains an in-depth technical explanation of the DocValues mechanisms, while this post focuses on the final implementation.

DocValues?

Whenever the content of a field is to be used for grouping, faceting, sorting, stats or streaming in Solr (or Elasticsearch or Lucene, where applicable), it is advisable to store it using DocValues. It is also used for document retrieval, depending on setup.

DocValues in Lucene 7: Linked lists

Lucene 7 shifted the API for DocValues from random access to sequential. This meant smaller storage footprint and cleaner code, but also caused the worst case single value lookup to scale linear with document count: Getting the value for a DocValued field from the last document in a segment required a visit to all other value blocks.

The linear access time was not a problem for small indexes or requests for values for a lot of documents, where most blocks needs to be visited anyway. Thus the downside of the change was largely unnoticeable or at least unnoticed. For some setups with larger indexes, it was very noticeable and for some of them it was also noticed. For our netarchive search setup, where each segments has 300M documents, there was a severe performance regression: 5-10x for common interactive use.

Text book optimization: Jump-tables

The Lucene 7 DocValues structure behaves as a linked list of data-nodes, with the specializations that it is build sequentially and that it is never updated after the build has finished. This makes it possible to collect the node offsets in the underlying index data during build and to store an array of these offsets along with the index data.

With the node offsets quickly accessible, worst-case access time for a DocValue entry becomes independent of document count. Of course, there is a lot more to this: See the previously mentioned Faster DocValues in Lucene/Solr 7+ for details.

One interesting detail for jump-tables is that they can be build both as a cache on first access (see LUCENE-8374) and baked into the index-data (see LUCENE-8585). I much preferred having both options available in Lucene, to get instant speed up with existing indexes and technically superior implementation for future indexes. Alas, only LUCENE-8585 was deemed acceptable.

Best case test case

Our netarchive search contains 89 Solr collections, each holding 300M documents in 900GB of index data. Each collection is 1 shard, merged down to 1 segment and never updated. Most fields are DocValues and they are heavily used for faceting, grouping, statistics, streaming exports and document retrieval. The impact of LUCENE-8585 should be significant.

In netarchive search, all collections are searched together using an alias. For the tests below only a single collection was used for practical reasons. There are three contenders:

  1. Unmodified Solr 7 collection, using Solr 8.0.0 RC1. Codename Solr 7.
    In this setup, jump-tables are not active as Solr 8.0.0 RC1, which includes LUCENE-8585, only supports index-time jump-tables. This is the same as Solr 7 behaviour.
  2. Solr 7 collection upgraded to Solr 8, using Solr 8.0.0 RC1. Codename Solr 8r1.
    In this setup, jump-tables are active and baked into the index data. This is the expected future behaviour when Solr 8 is released.
  3. Solr 7 collection, using Lucene/Solr at git commit point 05d728f57a28b9ab83208eda9e98c3b6a51830fc. Codename Solr 7 L8374.
    During LUCENE-8374 (search time jump tables) development, the implementation was committed to master. This was later reverted, but the checkout allow us to see what the performance would have been if this path had been chosen.

Test hardware is a puny 4-core i5 desktop with 16GB of RAM, a 6TB 7200RPM drive and a 1TB SSD. About 9GB of RAM free for disk cache. Due to time constraints only the streaming export test has been done on the spinning drive, the rest is SSD only.

Streaming exports

  • Premise: Solr’s export function is used by us to extract selected fields from the collection, typically to deliver a CSV-file with URLs, MIME types, file sizes etc for a corpus defined by a given filter. It requires DocValues to work.
  • DV-Problem: The current implementation of streaming export in Solr does not retrieve the field values in document order, making the access pattern extremely random. This is absolute worst case for sequential DocValues. Note that SOLR-13013 will hopefully remedy this at some point.

The test performs a streaming export of 4 fields for 52,653 documents in the 300M index. The same export is done 4 times, to measure the impact of caching.

curl '<solr>/export?q=text:hestevogn&sort=id+desc&
fl=content_type_ext,content_type_served,crawl_date,content_length'
run 1
seconds
run 2
seconds
run 3
seconds
run4
seconds
Solr 7 spin1705129713521314
Solr 8r1 spin834321
Solr 7 L8374 spin935111
Solr 7 SSD1276125812621262
Solr 8r1 SSD16121
Solr 7 L8374 SSD151
11

Observation: Both Solr 8r1 and Solr 7 L8374 vastly outperforms Solr 7. On a spinning drive there is a multi-minute penalty for run 1 after which the cache has been warmed. This is a well known phenomenon.

Faceting

  • Premise: Faceting is used everywhere and it is a hard recommendation to use DocValues for the requested fields.
  • DV-Problem: Filling the counters used when faceting is done in document order, which works well with sequential access as long as the jumps aren’t too long: Small result sets are relatively heavier penalized than large result sets.
Simple term-based searches with top-20 faceting on 6 fields of varying type and cardinality: domain, crawl_year, public_suffix, content_type_norm, status_code and host.

Reading the graphs: All charts in this blog post follows the same recipe:

  • X-axis is hit count (aka result set size), y-axis is response time (lower is better)
  • Hit counts are bucketed by order of magnitude and for each magnitude, boxes are shown for the three contenders: Blue boxes are Solr 7, pink are Solr 8r1 and green are Solr 7 L8374
  • The bottom of a box is the 25 percentile, the top is the 75 percentile. The black line in the middle is the median. Minimum response time for the bucket is the bottom spike, while the top spike is 95 percentile
  • Maximum response times are not shown as they tend to jitter severely due to garbage collection

Observation: Modest gains from jump-tables with both Solr 8rc1 and Solr 7 L8374.
Surprisingly the gains scale with hit count, which should be investigated further.

Grouping

  • Premise: Grouping is used in netarchive search to collapse multiple harvests of the same URL. As with faceting, using DocValues for grouping fields are highly recommended.
  • DV-Problem: As with faceting, group values are retrieved in document order and follows the same performance/scale logic.
Simple term-based searches with grouping on the high-cardinality (~250M unique values) field url_norm.

Observations: Modest gains from jump-tables, similar to faceting.

Sorting

  • Premise: Sorting is a basic functionality.
  • DV-Problem: As with faceting and grouping, the values used for sorting are retrieved in document order and follows the same performance/scale logic.

This tests performs simple term-based searches with sorting on the high-cardinality field content_length.

Observations: Modest gains from jump-tables.
Contrary to faceting and grouping, performance for high hit counts are the same for all 3 setups, which fits with the theoretical model.
Positively surprising is that the theoretical overhead of the jump-tables does not show for higher hit counts.

Document retrieval

  • Premise: Content intended for later retrieval can either be stored explicitly or as docValues. Doing both means extra storage, but also means that everything is retrieved from the same stored (and compressed) blocks, minimizing random access to the data. For the netarchive search at the Royal Danish Library we don’t double-store field data and nearly all of the 70 retrievable fields are docValues.
  • DV-Problem: Getting a search result is a multi-step process. Early on, the top-X documents matching the query are calculated and their IDs are collected. After that the IDs are used for retrieving document representations. If this is done from DocValues, it means random access linear to the number of documents requested.
Simple term-based relevance ranked searches for the top-20 matching documents with 9 core fields: id, source_file_s, url_norm, host, domain, content_type_served, content_length, crawl_date and content_language.

Observations: Solid performance improvement with jump-tables.

Production request

  • Premise: The different functionalities are usually requested in combination. At netarchive search a typical request uses grouping, faceting, cardinality counting and top-20 document retrieval.
  • DV-Problem: Combining functionality often means that separate parts of the index data are accessed. This can cause cache thrashing if there is not enough free memory for disk cache. With sequential DocValues, all intermediate blocks needs to be visited, increasing the need for disk cache. Jump-tables lowers the number of storage requests and are thus less reliant on cache size.
Simple term-based relevance ranked searches for the top-20 matching documents, doing grouping, faceting, cardinality and document retrieval as described in the tests above.

Observations: Solid performance improvement with jump tables.
As with the previous analysis of search-time jump tables, utilizing multiple DocValues-using functionality has a cocktail effect where the combined impact is larger than the sum of the parts. This might be due to disk cache thrashing.

Overall observations & conclusions

  • The effect of jump tables, both with Solr 8.0.0 RC1 and LUCENE-8374, is fairly limited; except for export and document retrieval, where the gains are solid.
  • The two different implementations of jump tables performs very similar. Do remember that these tests does not involve index updates at all: As LUCENE-8374 is search-time, it does have a startup penalty when indexes are updated.

For a the large segment index tested above, the positive impact of jump tables is clear. Furthermore there is no significant slow down for higher hit counts with faceting/grouping/statistics, where the jump tables has no positive impact.

Before running these tests, it was my suspicion that the search-time jump tables in LUCENE-8374 would perform better than the baked-in version. This showed not to be the case. As such, my idea of combining the approaches by creating in-memory copies of some of the on-storage jump tables has been shelved.

Missing

Performance testing is never complete, it just stops. Some interesting thing to explore could be

  • Spinning drives
  • Concurrent requests
  • Raw search speed with rows=0
  • Smaller corpus
  • Variations of rows, facet.limit and group.limit
  • Kibana and similar data-visualization tools
Advertisements
Posted in eskildsen, Hacking, Low-level, Lucene, Performance, Solr, Uncategorized | 5 Comments

Faster DocValues in Lucene/Solr 7+

This is a fairly technical post explaining LUCENE-8374 and its implications on Lucene, Solr and (qualified guess) Elasticsearch search and retrieval speed. It is primarily relevant for people with indexes of 100M+ documents.

Teaser

We have a Solr setup for Netarchive Search at the Royal Danish Library. Below are response times grouped by the magnitude of the hitCount with and without the Lucene patch.

teaser_run3_solrwayback

Grouping on url_norm, cardinality stats on url_norm, faceting on 6 fields and retrieval of all stored & docValued fields for the top-10 documents in our search result.

As can be seen, the median response time with the patch is about half that of vanilla Solr. The 95% percentile shows that the outliers has also been markedly reduced.

Long explanation follows as to what the patch does and why indexes with less than 100M documents are not likely to see the same performance boost.

Lucene/Solr (birds eye)

Lucene is a framework for building search engines. Solr is a search engine build using Lucene. Lucene, and thereby Solr, is known as an inverted index, referring to the termsdocuments structure that ensures fast searches in large amounts of material.

As with most things, the truth is a little more complicated. Fast searches are not enough: Quite obviously it also helps to deliver a rich document representation as part of the search. More advanced features are grouping, faceting, statistics, mass exports etc. All of these have in common that they at some point needs to map documentsterms.

Lucene indexes are logically made up of segments containing documents made up of fields containing terms (or numbers/booleans/raws…). Fields can be

  • indexed for searching, which means termsdocuments lookup
  • stored for document retrieval
  • docValues for documentsterms lookup

stored and docValues representations can both be used for building a document representation as part of common search. stored cannot be used for grouping, faceting and similar purposes. The two strengths of stored are

  • Compression, which is most effective for “large” content.
  • Locality, meaning that all the terms for stored fields for a given document are stored together, making is low-cost to retrieve the content for multiple fields.

Whenever grouping, faceting etc. needs the documentsterms mapping, it can either be resolved from docValues, which are build for this exact purpose, or by un-inverting the indexed terms. Un-inversion costs time & memory, so the strong recommendation is to enable docValues for grouping, faceting etc.

DocValues in Lucene/Solr 7+ (technical)

So the mission is to provide a documentsterms (and numbers/booleans/etc) lookup mechanism. In Lucene/Solr 4, 5 & 6 this mechanism had a random access API, meaning that terms could be requested for documents in no particular order. The implementation presented some challenges and from Lucene/Solr 7 this was changed to an iterator API (see LUCENE-7407), meaning that terms must be resolved in increasing document ID order. If the terms are needed for a document with a lower ID that previously requested, a new iterator must be created and the iteration starts from the beginning.

Most of the code for this is available in Lucene70DocValuesProducer and IndexedDISI. Digging into it, the gains from the iterative approach becomes apparent: Besides a very clean implementation with lower risk of errors, the representation is very compact and requires very little heap to access. Indeed, the heap requirement for the search nodes in Netarchive Search at the Royal Danish Library was nearly halved when upgrading from Solr 4 to Solr 7. The compact representation is primarily the work of Adrian Grand in LUCENE-7489 and LUCENE-7589.

When reading the wall of text below, it helps to mentally view the structures as linked lists: To get to a certain point in the list, all the entries between the current entry and the destination entry needs to be visited.

DocValues sparseness and packing

It is often the case that not all documents contains terms for a given field. When this is case, the field is called sparse.

A trivial representation for mapping documentsterms for a field with 0 or 1 long values per document would be an array of long[#documents_in_segment], but this takes up 8 bytes/document, whether the document has a value defined or not.

LUCENE-7489 optimizes sparse values by using indirection: First step is to determine whether a document has a value or not. If it has a value, an index into a value-structure is derived. The second step is to retrieve the value from the value-structure. IndedDISI takes care of the first step:

For each DocValues field, documents are grouped in blocks of 65536 documents. Each block starts with meta-data stating the block-ID and the number of documents in the block that has a value for the field. There are 4 types of blocks:

  • EMPTY: 0 documents in the block has a term.
  • SPARSE: 1-4095 documents in the block has a term.
  • DENSE: 4096-65535 documents in the block has a term.
  • ALL: 65536 documents in the block has a term.

Step 1.1: Block skipping

To determine if a document has a value and what the index of the value is, the following pseudo-code is used:

while (blockIndex < docID/65536) {
  valueIndex += block.documents_with_values_count
  block = seekToBlock(block.nextBlockOffset)
  blockIndex++}
if (!block.hasValue(docID%65536)) {  // No value for docID
  return
}
valueIndex += block.valueIndex(docID%65536)

Unfortunately it does not scale with index size: At the Netarchive at the Royal Danish Library, we use segments with 300M values (not a common use case), which means that 4,500 blocks must be iterated in the worst case.

Introducing an indexValue cache solves this and the code becomes

valueIndex = valueCache[docID/65536]
block = seekToBlock(offsetCache[docID/65536])
if (!block.hasValue(docID%65536) {  // No value for docID
  return
}
valueIndex += block.valueIndex(docID%65536)

The while-loop has been removed and getting to the needed block is constant-time.

Step 1.2: Block internals

Determining the value index inside of the block is trivial for EMPTY and ALL blocks. SPARSE is a list of the documentIDs with values that is simply iterated (this could be a binary search). This leaves DENSE, which is the interesting one.

DENSE blocks contains a bit for each of its 65536 documents, represented as a bitmap = long[1024]. Getting the value index is a matter of counting the set bits up to the wanted document ID:

inBlockID = docID%65536
while (inBlockIndex < inBlockID/64) {
  valueIndex += total_set_bits(bitmap[inBlockIndex++])
}
valueIndex += set_bits_up_to(bitmap[inBlockIndex], inBlockID%64)

This is not as bad as it seems as counting bits in a long is a single processor instruction on modern CPUs. Still, doing 1024 of anything to get a value is a bit much and this worst-case is valid for even small indexes.

This is solved by introducing another cache: rank = char[256] (a char is 16 bytes):

inBlockID = docID%65536
valueIndex = rank[inBlockID/8]
inBlockIndex = inBlockID/8*8
while (inBlockIndex < inBlockID/64) {
  valueIndex += total_set_bits(bitmap[inBlockIndex++])
}
valueIndex += set_bits_up_to(bitmap[inBlockIndex], inBlockID%64)

Worst-case it reduced to a rank-cache lookup and summing of the bits from 8 longs.

Now that step 1: Value existence and value index has been taken care of, the value itself needs to be resolved.

Step 2: Whole numbers representation

There are different types of values Lucene/Solr: Strings, whole numbers, floating point numbers, booleans and binaries. On top of that a field can be single- or multi-valued. Most of these values are represented in a way that provides direct lookup in Lucene/Solr 7, but whole numbers are special.

In Java whole numbers are represented in a fixed amount of bytes, depending on type: 1 byte for byte, 2 bytes for short or char, 4 bytes for integer and 8 bytes for long. This is often wasteful: The sequence [0, 3, 2, 1] could be represented using only 2 bits/value. The sequence [0, 30000000, 20000000, 10000000] could also be represented using only 2 bits/value if it is known that the greatest common divisor is 10⁷. The list of tricks goes on.

For whole numbers, Lucene/Solr uses both the smallest amount of bits required by PackedInts for a given sequence as well as greatest common divisor and constant offset. These compression techniques works poorly both for very short sequences and for very long ones; LUCENE-7589 splits whole numbers into sequences of 16384 numbers.

Getting the value for a given index is a matter of locating the right block and extracting the value from that block:

while (longBlockIndex < valueIndex/16386) {
  longBlock = seekToLongBlock(longBlock.nextBlockOffset)
  longBlockIndex++
}
value = longBlock.getValue(valueIndex%16386)

This uses the same principle as for value existence and the penalty for iteration is also there: In our 300M documents/segment index, we have 2 numeric fields where most values are present. They have 28,000 blocks each, which must be all be visited in the worst case.

The optimization is the same as for value existence: Introduce a jump table.

longBlock = seekToLongBlock(longJumps[valueIndex/16384))
value = longBlock.getValue(valueIndex%16386)

Value retrieval becomes constant time.

Theoretical observations

  • With a pure iterative approach, performance goes down when segment size goes up and the amount of data to retrieve goes up slower than index size. The performance slowdown only happens after a certain point! As long as the gap between the docIDs is small enough to be within the current or the subsequent data chunk, pure iteration is fine.
    Consequently, the requests that involves lots of monotonically increasing docID lookups (faceting, sorting & grouping for large result sets) fits the iterative API well as they needs data from most data blocks.
    Requests that involves fewer monotonically increasing docID lookups (export & document retrieval for all requests, faceting, sorting & grouping for small result sets) fits poorly as they result in iteration over data blocks that do not provide any other information than a link to the next data block.
  • As all the structures are storage-backed, iterating all data blocks – even when it is just to get a pointer to the next block – means a read request. This is problematic, unless there is plenty of RAM for caching: Besides the direct read-time impact, the docValues structures will hog the disk cache.

With this in mind, it makes sense to check the patch itself for performance regressions with requests for a lot of values as well as test with the disk cache fully warmed and containing the structures that are used. Alas, this has to go on the to-do for now.

Tests

Hardware & index

Testing was done against our production Netarchive Search. It consists of 84 collections, accessed as a single collection using Solr’s alias mechanism. Each collection is roughly 300M documents / 900GB of index data optimized to 1 segment, each segment on a separate SSD. Each machine has 384GB of RAM with about 220GB free for disk cache. There are 4 machines, each serving 25 collections (except the last one that only serves 9 at the moment). This means that ~1% of total index size is disk cached.

Methodology

  • Queries were constructed by extracting terms of varying use from the index and permutating them for simple 1-4 term queries
  • All tests were against the full production index, issued at times when it was not heavily used
  • Queries were issued single-threaded, with no repeat queries
  • All test setups were executed 3 times, with a new set of queries each time
  • The order of patch vs. sans-patch tests was always patch first, to ensure that any difference in patch favour was not due to better disk caching

How to read the charts

All charts are plotted with number of hits on the x-axis and time on the y-axis. The x-axis is logarithmic with the number of hits bucketed by magnitude: First bucket holds all measurements with 1-9 hits, second bucket holds those with 10-99 hits, the third holds those with 100-999 hits and so forth.

The response times are displayed as box plots where

  • Upper whisker is the 95% percentile
  • Top of the box is 75% percentile
  • Black bar is 50% percentile (the median)
  • Bottom of the box is 25% percentile
  • Lower whisker is minimum measured time

Each bucket holds 4 boxes

  • Test run 2, patch enabled
  • Test run 2, vanilla Solr
  • Test run 3, patch enabled
  • Test run 3, vanilla Solr

Test run 1 is discarded to avoid jitter from cache warming. Ideally the boxes from run 3 should be the same as for run 2. However, as the queries are always new and unique, an amount of variation is to be expected.

Important note 1: The Y-axis max-value changes between some of the charts.

Document retrieval

There seems to be some disagreement as to whether the docValues mechanism should ever be used to populate documents, as opposed to using stored. This blog post will only note that docValues are indeed used for this purpose at the Royal Danish Library and let it be up to the reader to seek more information on the matter.

There are about 70 fields in total in Netarchive Search, with the vast majority being docValued String fields. There are 6 numeric DocValued fields.

docs_run23_rows20

Retrieval of top-20 documents with all field values

Observation: Response times for patched (blue & green) are markedly lower than vanilla (ping & orange). The difference is fairly independent of hit count, which matches well with the premise that the result set size is constant at 20 documents.

Grouping

Grouping on the String field url_norm field is used in Netarchive Search to avoid seeing too many duplicates. To remove the pronounced difference caused by document retrieval, only the single field url_norm is requested for only 1 group with 1 document.

grouping_run23_rows1_fl1

Grouping on url_norm

Observation: The medians for patched vs. vanilla are about the same, with a slight edge to patched. The outliers (the top T of the boxes) are higher for vanilla.

Faceting

Faceting is done for 6 fields of varying cardinality. As with grouping, the effect of document retrieval is sought minimized.

faceting_run23_rows1_fl1

Faceting on fields domain, crawl_year, public_suffix, content_type_norm, status_code, host

Observation: Patched is an improvement over vanilla up to 10M+ hits.

Sorting

In this test, sorting is done descending on content_length, to locate the largest documents in the index. As with grouping, the effect of document retrieval is sought minimized.

sorting_run23_rows1_fl1

Sorting on content_length

Observation: Patched is a slight improvement over vanilla.

Cardinality

In order to provide an approximate hitCount with grouping, the cardinality of the url_norm field is requested. As with grouping, the effect of document retrieval is sought minimized.

HyperLogLog cardinality on url_norm

HyperLogLog cardinality on url_norm

Observation: Too much jitter to say if patch helps here.

Numeric statistics

Statistics (min, max, average…) on content_length is a common use case in Netarchive Search. As with grouping, the effect of document retrieval is sought minimized.

stats__rows1_fl1

Numeric statistics on content_length

Observation: Patched is a slight improvement over vanilla.

Cocktail effect, sans document

Combining faceting, grouping, stats and sorting while still minimizing the effect of document retrieval.

multi_rows1_fl1

Faceting on 6 fields, grouping on url_norm, stats on content_length and sorting on content_length

Observation: Patched is a clear improvement over vanilla.

Production request combination

The SolrWayback front end for Netarchive Search commonly use document retrieval for top-10 results, grouping, cardinality and faceting. This is the same chart as the teaser at the top, with the addition of test run 2.

cocktail_run23_solrwayback

Grouping on url_norm, cardinality stats on url_norm, faceting on 6 fields and retrieval of all stored & docValued fields for the top-10 documents in our search result.

Observation: Patched is a pronounced improvement over vanilla.

The combination of multiple docValues using request parameters is interesting as the effect of the patch on the whole seems greater than the sum of the individual parts. This could be explained by cache/IO saturation when using vanilla Solr. Whether the cause, this shows that it is important to try and simulate real-world workflows as close as possible.

Overall observations

  • For most of the performance tests, the effect of the LUCENE-8374 patch vs. vanilla is pronounced, but limited in magnitude
  • Besides lowing the median, there seems to be a tendency for the patch to reduce outliers, notably for  grouping
  • For document retrieval, the patch improved performance significantly. Separate experiments shows that export gets a similar speed boost
  • For all the single-feature tests, the active parts of the index data are so small that they are probably cached. Coupled with the limited improvement that the patch gives for these tests, it indicates that the patch will in general have little effect on systems where the full index is is disk cache
  • The performance gains with the “Production request combination” aka the standard requests from our researchers, are very high

Future testing

  • Potential regression for large hit counts
  • Max response times (not just percentile 95)
  • Concurrent requests
  • IO-load during tests
  • Smaller corpus
  • Export/streaming
  • Disk cache influence

Want to try?

There is a patch for Solr trunk at LUCENE-8374 and it needs third party validation from people with large indexes. I’ll port it to any Solr 7.x-version requested and if building Solr is a problem, I can also compile it and put it somewhere.

Hopefully it will be part of Solr at some point.

Update 20181003: Patch overhead and status

Currently the patch is search-time only. Technically is could also be index-time by modifying the codec.

For a single index in the Netarchive Search setup, the patch adds 13 seconds to first search-time and 31MB of heap out of 8GB allocated for the whole Solr. The 13 seconds is in the same ballpark (this is hard to measure) as a single unwarmed search with top-1000 document retrieval.

The patch is ported to Solr 7.3.0 and used in production at the Royal Danish Library. It is a debug-patch, meaning that the individual optimizations can be enabled selectively for easy performance comparison.

See the LUCENE-8374 JIRA-issue for details.

Posted in eskildsen, Hacking, Low-level, Lucene, Performance, Solr | 1 Comment

Prebuild Big Data Word2Vec dictionaries

                   Prebuild and trained Word2Vec dictionaries ready for use

Two different prebuild big data Word2Vec dictionaries has been added to LOAR (Library Open Access Repository) for download. These dictionaries are build from the text of 55,000 e-books from Project Gutenberg and 32.000.000 Danish newspaper pages.

35.000 of the Gutenberg e-books are English, but over 50 different languages are present in the dictionaries. Even though they are different languages the Word2Vec algorithm did a good job of separating the different languages so it is almost like 50 different Word2Vec dictionaries.

The text from the danish newspapers is not public available so you would not be able to build this dictionary yourself. A total of 300Gb of raw text went into building the dictionary, so it is probably the largest Word2Vec dictionary build on a Danish corpus. Since the danish newspapers suffer from low quality OCR, many of words in the dictionary are misspellings. Using this dictionary it was possible to fix many of the OCR errors due the nature of the Word2Vec algorithm, since a given word appears in similar contexts despite its misspellings and is identified by its context. (see https://sbdevel.wordpress.com/2017/02/02/automated-improvement-of-search-in-low-quality-ocr-using-word2vec/)

Download and more information about the Word2Vec dictionaries:

Download

 

Online demo of the two corpora: Word2Vec demo

 

 

 

 

 

 

Posted in Uncategorized | Leave a comment

SolrWayback software bundle has been released

The SolrWayback software bundle can be used to search and playback archived webpages in Warc format. It is an out of the box solution with index workflow, Solr and Tomcat webserver and a free text search interface with playback functionality. Just add your Warc to a folder and start the index job.

The search interface has additional features besides freetext search. This includes:

  • Image search similar to google images
  • Search by uploading a file. (image/pdf etc.) See if the resource has been harvested and from where.
  • Link graph showing links (ingoing/outgoing) for domains using the D3 javascript framework.
  • Raw download of any harvested resource from the binary Arc/Warc file.
  • Export a search resultset to a Warc-file. Streaming download, no limit of size of resultset.
  • An optional built in SOCKS proxy can be used to view historical webpages without browser leaking resources from the live web.

See the GitHub page for screenshots of SolrWayback and scroll down to the install guide try it out.

Link: SolrWayback

 

solrwayback_search.png

solrwayback_linkgraph.pnggps_exif_search.png

 

Posted in Uncategorized | Leave a comment

Visualising Netarchive Harvests

 

An overview of website harvest data is important for both research and development operations in the netarchive team at Det Kgl. Bibliotek. In this post we present a recent frontend visualisation widget we have made.

From the SolrWayback Machine we can extract an array of dates of all harvests of a given URL. These dates are processed in the browser into a data object containing the years, months, weeks and days to enable us to visualise the data. Futhermore the data is given an activity level from 0-4.

The high-level overview seen below is the year-month graph. Each cell is coloured based on the activity level relative to the number of harvests in the most active month. For now we use a linear calculation so gray means no activity, activity level 1 is 0-25% of the most active month, and level 4 is 75-100% of the most active month. As GitHub fans, we have borrowed the activity level colouring from the user commit heat map.

1-overview-no-url

 

We can visualise a more detailed view of the data as either a week-day view of the whole year, or as a view of all days since the first harvest. Clicking one of these days reveals all the harvests for the given day, with links back to SolrWayback to see a particular harvest.

2-year-week-no-url

 

In the graph above we see the days of all weeks of 2009 as vertical rows. The same visualisation can be made for all harvest data for the URL, as seen below (cut off before 2011, for this blog post).

3-all-years-no-url

 

There are both advantages and disadvantages to using the browser-based data calculation. One of the main advantages is a very portable frontend application. It can be used with any backend application that outputs an array of dates. The initial idea was to make the application usable for several different in-house projects. Drawbacks to this approach is, of course, the scalability. Currently the application processes 25.000 dates in about 3-5 seconds on the computer used to develop the application (a 2016 quad core Intel i5).

The application uses the frontend library VueJS and only one other dependency, the date-fns library. It is completely self-contained and it is included in a single script tag, including styles.

Ideas for further development.

We would like to expand this to also include both:

  1. multiple URLs, which would be nice for resources that have changed domain, subdomain or protocol over time, e.g. the URL http://pol.dk, http://www.pol.dk and https://politiken.dk could be used for the danish newspaper Politiken.
  2. domain visualisation for all URLs on a domain. A challenge here will of course be the resources needed to process the data in the browser. Perhaps a better calculation method must be used – or a kind of lazy loading.
Posted in Blogging, Solr, Web | Tagged , | Leave a comment

SolrWayback Machine

Another ‘google innovation week’ at work has produced the SolrWayback Machine. It works similar to the Internet Archive: Wayback Machine (https://archive.org/web/) and can be used to show harvested web content (Warc files).  The Danish Internet Archive has over 20billion harvested web objects and takes 1 Petabyte of storage.

The SolrWayback engine require you have indexed the Warc files using the Warc-indexer tool from British Library. (https://github.com/ukwa/webarchive-discovery/tree/master/warc-indexer).

It is quite fast and comes with some additional features as well:

  •  Image search similar to google images
  •  Link graphs showing  links (ingoing/outgoing) for domains using the D3 javascript framework.
  •  Raw download of any harvested resource from the binary Arc/Warc file.

Unfortunately  the collection is not available for the public so I can not show you the demo. But here is a few pictures from the SolrWayback machine.

solrwayback_demo

solrwayback_linkgraph.png

SolrWayback at GitHub: https://github.com/netarchivesuite/solrwayback/

Posted in Uncategorized | 1 Comment

juxta – image collage with metadata

Creating large collages of images to give a bird’s eye view of a collection seems to be gaining traction. Two recent initiatives:

Combining those two ideas seemed like a logical next step and juxta was born: A fairly small bash-script for creating million-scale collages of images, with no special server side.  There’s a small (just 1000 images) demo at SBLabs.

Presentation principle

The goal is to provide a seamless transition from the full collection to individual items, making it possible to compare nearby items with each other and locate interesting ones. Contextual metadata should be provided for general information and provenance.

Concretely, the user is presented with all images at once and can zoom in to individual images in full size. Beyond a given threshold, metadata are show for the image currently under the cursor, or finger if a mobile device is used. An image description is displayed just below the focused image, to avoid disturbing the view. A link to the source of the image is provided on top.

kob1_crop

Overview of historical maps

kob2_crop

Meta-data for a specific map

Technical notes, mostly on scaling

On the display side, OpenSeadragon takes care of the nice zooming. When the user moves the focus, a tiny bit of JavaScript spatial math resolves image identity and visual boundaries.

OpenSeadragon uses pyramid tiles for display and supports the Deep Zoom protocol can be implemented using only static files. The image to display is made up of tiles of (typically) 256×256 pixels. When the view is fully zoomed, only the tiles within the viewport are requested. When the user zooms out, the tiles from the level above are used. The level above is half the width and half the height and is thus represented by ¼ the amount of tiles. And so forth.

Generating tiles is heavy

A direct way of creating the tiles is

  1. Create one large image of the full collage (ImageMagick’s montage is good for this)
  2. Generate tiles for the image
  3. Scale the image down to 50%×50%
  4. If the image is larger than 1×1 pixel then goto 2

Unfortunately this does not scale particularly well. Depending on size and tools, it can take up terabytes of temporary disk space to create the full collage image.

By introducing a size constraint, juxta removes this step: All individual source images are scaled & padded to have the exact same size. The width and height of the images are exact multiples of 256. Then the tiles can be created by

  1. For each individual source image, scale, pad and split the image directly into tiles
  2. Create the tiles at the level above individually by joining the corresponding 4 tiles below and scale to 50%×50% size
  3. If there are more than 1 tile or that tile is larger than 1×1 pixel then goto 2

As the tiles are generated directly from either source images or other tiles, there is no temporary storage overhead. As each source image and each tile are processed individually, it is simple to do parallel processing.

Metadata takes up space too

Displaying image-specific metadata is simple when there are just a few thousand images: Use an in-memory array of Strings to hold the metadata and fetch it directly from there. But when the number of images goes into the millions, this quickly becomes unwieldy.

juxta groups the images spatially in buckets of 50×50 images. The metadata for all the images in a bucket are stored in the same file. When the user moved the focus to a new image, the relevant bucket is fetched from the server and the metadata are extracted. A bucket cache is used to minimize repeat calls.

Most file systems don’t like to hold a lot of files in the same folder

While the limits differ, common file systems such as ext, hfs & ntfs all experience performance degradation with high numbers of files in the same folder.

The Deep Zoom protocol in conjunction with file-based tiles means that the amount of files at the deepest zoom level is linear to the number of source images. If there are 1 million source images, with full-zoom size 512×512 pixels (2×2 tiles), the number of files in a single folder will be 2*2*1M = 4 million. Far beyond the comfort-zone fo the mentioned file systems (see the juxta readme for tests of performance degradation).

juxta mitigates this by bucketing tiles in sub-folders. This ensures linear scaling of build time at least up to 5-10 million images. 100 million+ images would likely deteriorate build performance markedly, but at that point we are also entering “is there enough free inodes on the file system?” territory.

Unfortunately the bucketing of the tile files is not in the Deep Zoom standard. With OpenSeadragon, it is very easy to change the mapping, but it might be more difficult for other Deep Zoom-expecting tools.

Some numbers

Using a fairly modern i5 desktop and 3 threads, generating a collage of 280 5MPixel images, scaled down to 1024×768 pixels (4×3 tiles) took 88 seconds or about 3 images/second. Repeating the experiment with a down-scale to 256×256 pixels (smallest possible size) raised the speed to about 7½ image/second.

juxta comes with a scale-testing script that generates sample images that are close (but not equal) to the wanted size and repeats them for the collage. With this near-ideal match, processing speed was 5½ images/second for 4×3 tiles and 33 images/second for 1×1 tiles.

The scale-test script has been used up to 5 million images, with processing time practically linear to the number of images. At 33 images/second that is 42 hours.

Posted in Uncategorized | Leave a comment