Dumb-down at Indexing or Nested Data in the Solr Search Engine

Sigfrid Lundberg, Ph. D., Software Developer

Royal Danish Library

Copenhagen

Denmark

twittergithubweb site

Are passions, then, the Pagans of the soul?

Reason alone baptized? alone ordain’d

To touch things sacred?

(Edward Young — 1683-1765)

Introduction

The Royal Danish Library has been using the Solr
(https://solr.apache.org/) search engine for at least a decade. Almost
all projects that need some search facilities are using it. A Swiss
army knife for searching in small as well as well as big data sets. A
trusty tool that provide many advantages to the alternative to use a
relational database management system (RDBMS) when working with
resource discovery systems.

At the first Solr workshop I participated the teacher reiterated over
and over again that Solr is a search engine, not a data management
tool. Although Solr can Create, Retrieve, Update and Delete
(CRUD)
documents, the transactions cannot really be characterized as having
Atomicity, Consistency, Isolation and Durability
(ACID).

The role of bibliographic structure

In a library we encounter data with much more structure than simple
packages of attributes and values, but also much less structure than what
you expect from data in database normal form. Bibliographical
data may for instance describe texts that

  • have one or more titles, each having different type (main title, subtitle, translated title, transcribed title, uniform title to iterate some of the frequently used ones)
  • may have one or more authors that may be persons or organizations
  • each of which may have dates of birth and death
  • and an affiliation

If we were using a RDBMS, the data on persons could be stored in one
database table, the titles in another and there could be a third table
keeping track (through joins and foreign keys) of each person’s
relations to the works to which they’ve contributed. Someone made the
illustrations, someone else wrote the texts. A third one made the
graphical design.

For a portrait photograph we have one person being the photographer,
and another being the subject. The data on the subject can be as
important as the data on the artist.

These data are important. For instance the dates are used for
distinguishing between people with the same name. (The digital phone
book krak.dk lists 43 now living persons named Søren Kierkegaard,
while the philosopher died 1855). An import use case is how to decide
whether a given object is free from copyright or not, such as when the
originator has been dead for more than 75 years.

Add to this that there are many more complications, like we may have
multiple copies of a given title and that each copy may belong to different
collections with very different provenance. For historical objects we
may even have a manuscript in the manuscript and rare books collections
and a modern pocket book in the open stocks.

The role of the primary purpose of these data are to enable library
patrons and service users to search and retrieve. Typically users type
a single word in the search form, while ignoring the handful of fields
that we provide through the effort of catalogers, software developers
etc.

The user gets a far too long list of results which contains author,
title and perhaps some subjects or keywords. Again typically, the user
clicks on a title an gets a landing page. This is far more detailed;
it presents many fields that will enable users to decide whether to
order or retrieve the object.

If we grossly simplify the process, a user might be up to one of two
things: (1) Finding or resolving a given reference, i.e., the right
object, perhaps the right edition. Or (2) Finding information on a
given topic. See these two as endpoints on a scale.

You may look for a particular book Enten – Eller (Either/Or) by
Kierkegaard (1843) or you might be interested in the role of this
philosopher in your study of the origin of existentialism. In the
former case you actually look for Kierkegaard in the author field, in
the latter case you look for him in subject field.

Enten - eller cover page

Encoding, indexing and using

Assume we are about to add metadata on Enten – eller by Søren Kierkegaard into a Solr index. What we get for that book might
contain the data below. The example is encoded in a format called
Metadata Object Description Schema. Note that the namespace
prefix md stands for the URI http://www.loc.gov/mods/v3

  <md:titleInfo>
    <md:title>Enten - eller</md:title>
    <md:subTitle>Et livsfragment</md:subTitle>
  </md:titleInfo>

  <md:name displayLabel="Author"
           type="personal"
           authorityURI="https://viaf.org/viaf/7392250/"
           xml:lang="en">
    <md:namePart type="family">Kierkegaard</md:namePart>
    <md:namePart type="given">Søren</md:namePart>
    <md:namePart type="date">1813/1855</md:namePart>
    <md:alternativeName altType="pseudonym">
      <md:namePart>Victor Eremita</md:namePart>
      <md:description>“Victorious hermit,” general editor of
      Either/Or, who also appears in the first part of its sequel,
      Stages on Life’s Way. Also the author of the satirical article
      “A Word of Thanks to Professor Heiberg.”</md:description>
    </md:alternativeName>
    <md:role>
      <md:roleTerm type="code">aut</md:roleTerm>
    </md:role>
  </md:name>

  <md:originInfo>
    <md:dateCreated>1843</md:dateCreated>
    <md:publisher>Universitetsboghandler C. A. Reitzel</md:publisher>
  </md:originInfo>

This is a fake record I created for the purpose of this paper. The
work has a title and name. The name has a role (aut as in
author) and parts like family (Kierkegaard), given (Søren) and a date
(1813/1855 which is ISO’s way to express a date range,
i.e., the years between which the philosopher was alive. His Fragment of Life.)

To get the birth and death dates you have to parse a string. As a
matter of fact, the name on the book cover wasn’t Søren Kierkegaard,
but Victor Eremita (Victorious hermit), encoded as an
alternativeName. A telling pseudonym of the author of The Seducer’s
Diary. Søren was good at pseudonyms.

The <md:role> ... </md:role> permits the cataloger to encode that a
person called Søren has the aut relation to the work. Library of
Congress have listed houndreds of such relators. Actually,
each of those could be seen as a field. The thing is that even in a
large bibliographic database there would be very few records where
Data manager (dtm), Former owner (fmo) and
Librettist (lbt) would contain any data.

Now we’ve identified a lot of possible fields to use, for cataloging
and for information retrieval. They have perfectly reasonable use
cases, and all of them are used in everyday library practice, so how
do I get them into my Solr index?

The attempt

We have tried to put such records into Solr. The attempt was
successful. In the rest of this paper I will outline how we did that,
learn you a bit on how to use such an index and finally why decided
not to implement it.

In our experiments we transformed MODS records to nested Solr records,
such as the record below, which is transformed from my fake record above.

[
  {
    "id": "https://example.org/record",
    "described": true,
    "entity_type": "the_object",
    "cataloging_language": "en",
    "record_created": "2022-08-12",
    "tit": [
      {
        "describing": "https://example.org/record",
        "described": false,
        "entity_type": "title main",
        "title": [
          "Enten - eller"
        ],
        "id": "https://example.org/record!disposable!subrecord!d1e21"
      }
    ],
    "aut": [
      {
        "id": "https://example.org/record!disposable!subrecord!d1e30",
        "authority": "https://viaf.org/viaf/7392250/",
        "described": false,
        "describing": "https://example.org/record",
        "language": "en",
        "entity_type": "aut",
        "agent_name": "Kierkegaard Søren (1813/1855)"
      }
    ],
    "visible_date": [
      "1843"
    ],
  }
]

If you are familiar with the workings of Solr, you know that the
data-model (if I may label it as such) used is configured in a file
call schema.xml. It basically contains list of fields that can be
used in what is referred to as Solr documents. In such a schema you
may add

    <field     name="_nest_path_" type="_nest_path_" stored="true" indexed="true" />
    <field     name="_nest_parent_" type="string" indexed="true" stored="true" />

the former of which is of the following type:

    <fieldType name="_nest_path_" class="solr.NestPathField" />

See the Solr Indexing Nested Child Documents
documentation.

The nested indexing works since the indexer stores an xpath like
entity for each record, making it possible track which Solr document
which is parent and which document which is child which is the
parent. That info is in the _nest_path_ field and Solr does that
automatically whenever it starts a new document inside a parent one.

You will get that information back from the server if you add a Solr
field list argument (fl) at search time

fl=*,[child]

That is straight forward. The problem is then to make Solr search in
the child documents and return the parent or root document.

{!parent which="described:true"}{!edismax v="agent_name:(Kierkegaard Søren) AND entity_type:aut"}
AND
{!parent which="described:true"}{!edismax v="title:(Enten - eller) AND entity_type:tit"}

The constructs {!parent ... } and {!edismax ... } are so called
local parameters in a Solr request. The former specifies that we want
Solr to return parent documents such the described:true, the latter
tells Solr we want the author to be Søren and title to be Enten –
eller. Now we can reasonably easy search and retrieve information on
the Etcher (etr) and Dancer (dnc), when applicable.

This is a special case of join as implemented in Solr. Recall that
joins are at very very core of SQL, and one of the features making
the RDBMS such a powerful tool.

Also recall that I mentioned that my first Solr instructor dissuaded
us from using search engines as data stores. Does that generalize to
other features coming from the database world?

The user problems

I hope I’ve been able to convince you that the fairly complicated
metadata structures used in libraries are useful for patrons and
staff. They were not invented for giving software developers gray hair
and age prematurely. Also, it is legitimate use case to be able to
identify the etchers and dancers.

However:

  • We do, however, know that users at of our resources are not very
    good at using fields. An interface allowing you to search
    portraiture subjects is very specialized use case. So is the use
    case to be able to search for senders and recipients of letters.
  • People do search for word in a title, but they do not search for A life fragment separate from Either/or. Likewise they not
    particularly interested in making a difference between Enten – eller and Either/or. If they search for the latter they
    presumably want an English translation, but when studying a detailed
    presentation they are almost certainly interested to know that
    Either/or is actually a translation.

You know, each performance of Весна священная (AKA The Rite of
Spring) has a conductor, director and choreographer and a lot of
dancers, obviously in addition to Стравинский, Игорь Фёдорович (AKA
Igor Stravinsky, the composer). I could go on here. You could add from your own
experience.

To make a useful service we have to aggregate data into reasonable
headlines. Dublin Core Metadata Initiative has actually a name for
this: The Dumb-Down Principle

The developer problems

From the developers point of view, metadata dumb-down can take place,
either (i) when indexing or (ii) when searching.

In either case, for a ballet performance we would dumb-down Composer (cmp), Conductor
(cnd), Director (drt) and Choreographer (chr) to one single
repeatable field
creator. It
would contain Igor Stravinsky (the transcribed, but perhaps also his name
in Cyrillic), and obviously all other creatives. Most of the dancers would
most likely go to the
contributor
field.

Doing dumb-down at indexing would mean to create fields creator and
contributor in the index, to do it when searching would imply to do
it using the horrendous search syntax presented above. Then you have
to do the same for title and other relevant fields.

In the case of Either/or, Enten – eller the dumb-down solr record
would look somewhat as the record below:

[
  {
    "id": "https:!!example.org!record",
    "title": [
      "Enten - eller"
    ],
    "creator": [
      "Kierkegaard, Søren 1813/1855"
    ],
    "record_created": "2022-08-12",
    "visible_date": [
      "1843"
    ],
    "original_object_identifier": [],
    "pages": []
  }
]

Hence when indexing we only create one record, and no joins are
needed. A query could be

creator:kierkegaard AND title:(enten eller)

The drawback being that the in the index we cannot tell the
difference between Igor Stravinsky (cmp) and the Conductor
(cnd). Both are creators. The dumbed-down index has lost most of the
information you need to decide whether you want to listen to an album
or see a performance.

  1. At indexing: Your search syntax is nice and clean. You have to use
    some other method to present the data in the detailed view.
  2. At search: Your search syntax is very complicated. On the other
    hand, you have all the data needed for the detailed view.
  3. At a practical level, the nested Solr seems more or less
    experimental, and the documentation is less than excellent. Only
    the lucene queryparser
    supports it, and when searching with (for example) edismax query
    parser you run into the syntactic problem with local parameters
    demonstrated above.

If we are to describe the situation in Model-View-Controller (MVC)
terms, the second (i.e., the at search implementation) looks nice. One
model, one controller but (perhaps) two views. When doing it at
indexing, we need two models and an architecture diagram might look
much more messy. Semantic exercise to make the dumb-down scheme might
seem complicated. The code, however, is much simplified.

The fact that each substructure in the nested Solr document
must follow the same schema is an annoying feature. It isn’t
important, but persons, subjects and whatever all have the same
content model (in the sense of an XML DTD or Schema), makes the setup
much less attractive.

Finally, it is my experience that it easier to accommodate multiple
metadata models and standards in the same index with dumb-down at
indexing. In our case we opted for transforming our MODS records to
schema.org for the detailed presentation. Hence,
retrieval will be from a separate datastore. The schema.org ontology
is rich enough for our landing pages and detailed result sets. It
provides an extra bonus, we hope, in that Google would actually be
able to index our collection.

The only advantage I can see with at search time dumb-down is that we
would have only a single model in our search application.

Conclusion

In the end, after some weeks work, we threw out our nested indexing
stuff and most likely we a threw out some baby we were not aware of
with the bathwater. Be that as it may, we opted for an easy format for
search, while retaining interoperability for other uses.

Library patrons have more needs than resource discovery. Some use APIs
for study, research or for services of their own. The search index,
schema.org, the original mods will eventually be available for such
purposes. It could be that a nested index could actually be useful for
such users.

Advertisement
Posted in sigge, Solr, usability | Leave a comment

SolrWayback 4.0 release! What’s it all about? Part 2


In this blog post I will go into the more technical details of SolrWayback and the new version 4.0 release. The whole frontend GUI was rewritten from scratch to be up to date with 2020 web-applications expectations along with many new features implemented in the backend. I recommend reading the frontend blog post first. The frontend blog post has beautiful animated gifs demonstrating most of the features in SolrWayback.

Live demo of SolrWayback

You can access a live demo of SolrWayback here.
Thanks to National Széchényi Library of Hungary for providing the SolrWayback demo site!

Back in 2018…

The open source SolrWayback project was created in 2018 as an alternative to the existing webarchive frontend applications at that time. At the Royal Danish Library we were already using Blacklight as search frontend. Blacklight is an all purpose Solr frontend application and is very easy to configure and install by defining a few properties such as Solr server url, fields and facet fields. But since Blacklight is a generic solr-frontend it had no special handling of the rich datastructure we had in Solr. Also the binary data such as images and videos are not in Solr, so integration to the WARC-file repository can enrich the experience and make playback possible, since Solr has enough information to work as CDX server also.

Another interesting frontend was the Shine frontend. It was custom tailored for the Solr index created with WARC-indexer and had features such as Trend analysis (n-gram) visualization of search results over time. The showstopper was that Shine was using an older version the Play-framework and the latest version of the Play-framework was not backwards compatible to the maintained branch of the Play-framework. Upgrading was far from trivial and would require a major rewrite of the application. Adding to that, the frontend developers had years of experience with the larger more widely used pure javascript-frameworks. The weapon of choice by the frontenders for SolrWayback was the VUE JS framework. Both SolrWayback 3.0 and the new rewritten SolrWayback 4.0 had the frontend developed in VUE JS. If you have skills in VUE JS and interest in SolrWayback your collaboration will be appriciate 🙂

WARC-Indexer. Where the magic happens!

WARC-files are indexed into Solr using the WARC-Indexer. The WARC-Indexer reads every WARC record,extracts all kind of information and splits this into up to 60 different fields. It uses Tika to parse all the different Mime types that can be encountered in WARC-files. Tika extract the text from HTML, PDF, Excel, Word documents etc. It also extracts metadata from binary documents if present. The metadata can include created/modified time, title, description, author etc. For images meta-data it can also width/height or exif information such as latitude/longitude. The binary data themselves are not stored in Solr but for every record in the warc-file there is a record in Solr. This also includes empty records such as HTTP 302 (MOVED) with information about the new URL.

WARC-Indexer. Paying the price up front…

Indexing a large amount of warc-files require massive amounts of CPU, but is easily parallelized as the warc-indexer takes a single warc-file as input. Indexing 700 TB (5.5M WARC files) of warc-files took 3 months using 280 CPUs to give an idea of the requirements.
When the existing collection is indexed, it is easier to keep up with the incremental growth of the collection. So this is the drawback when using SolrWayback on large collections: The Warc-files have to be indexed first.

Solr provides multiple ways of aggregating data, moving common netarchive statistics tasks from slow batch processing to interactive requests. Based on input from researchers, the feature set is continuously expanding with aggregation, visualization and extraction of data.
Due to the amazing performance of Solr, the query is often performed in less than 2 seconds in a collection with 32 billion (32*10⁹) documents and this includes facets. The search results are not limited to HTML pages where the freetext is found, but every document that matches the search query. When presenting the results each document type has custom display for that mime-type.
HTML results are enriched with showing thumbnail images from page as part of the result, images are shown directly, and audio and video files can be be play directly from the results list with an in-browser player or downloaded if the browser does not support that format.

Solr. Reaping the benefits from the Warc-indexer

The SolrWayback java-backend offers a lot more than just sending queries to Solr and returning them to the frontend. Methods can aggregate data from multiple Solr queries or directly read WARC entries and return the processed data in a simple format to the frontend. Instead of re-parsing the warc files, which is a very tedious task, the information can be retreived from Solr, and the task can be done in seconds/minutes instead of weeks.

See the frontend blog post for more feature examples.

Wordcloud
Generating a wordcloud image is done by extracting text from 1000 random HTML pages from the domain and generate a wordcloud from the extracted text.

Interactive linkgraph
By extracting domains that links to a given domain(A) and also extract outgoing links from that domain(A) you can build a link-graph. Repeating this for new domains found gives you a two-level local linkgraph for the domain(A). Even though this can be 100s of seperate Solr-queries it is still done in seconds on a large corpus. Clicking a domain will highlight neighbors in the graph.
(Try demo:interactive linkgraph)

Large scale linkgraph
Extraction of massive linkgraphs with up to 500K domains can be done in hours. Link graph example from the Danish NetArchive.
The exported link-graph data was rendered in Gephi and made zoomable and interactive using Graph presenter.
The link-graphs can be exported fast as all links (a href) for each HTML-record are extracted and indexed as part of the corresponding Solr document.

Image search
Freetext search can be used to find HTML documents. The HTML documents in Solr are already enriched with image links on that page without having to parse the HTML again. Instead of showing the HTML pages, SolrWayback collects all the images from the pages and shows them in a Google like image search result. Under the assumption that text on the HTML page relates to the images, you can find images for can match the query. If you search for “Cats” in HTML pages , the results found will mostly likely show pictures of cats. The pictures could not be found by just searching for the image documents if no meta data (or image-name) has “Cats” as part of it.

CVS Stream export
You can export result sets with millionsof documents to a CSV file. Instead of exporting all possible 60 Solr fields for each result, you can custom pick which fields to export. This CSV export has been used by several researchers at the Royal Danish Library already and gives them the opportunity to use other tools, such as RStudio, to perform analysis on the data. The National Széchényi Library demo site has disabled CSV export in the SolrWayback configuration, so it can not be tested live.

WARC corpus extraction
Besides CSV export, you can also export a result to a WARC-file. The export will read the Warc-entry for each document in the resultset and copy the WARC-header+ Http-header + payload and create a new Warc-file with all results combined.
Extract a sub-corpus this easy has already shown to be extremely useful for researchers. Examples includes extracting of a domain for a given date range, or query with restriction to a list of defined domains. This export is a 1-1 mapping from the result in Solr to the entries in the WARC-files.
SolrWayback can also perform an extended WARC-export which will include all resources(js/css/images) for every HTML page in the export.
The extended export ensures that playback will also work for the sub-corpus. Since the exported Warc file can become very large, you can use a WARC splitter tool or just split up the export in smaller batches by adding crawl year/month to the query etc. The National Széchényi Library demo site has disabled WARC export in the SolrWayback configuration, so it can not be tested live.

SolrWayback playback engine

SolrWayback has a built-in playback engine, but using it is optional and SolrWayback can be configured to use any other playback engine that uses the same API in URL for playback “/server/<date>/<url>” such as PyWb. It has been a common misunderstanding that SolrWayback forces you to use the SolrWayback playback engine. The demo at National Széchényi Library has configured PyWb as alternative playback engine. Clicking the icon next to the titel for a HTML result will open playback in PyWb instead of SolrWayback.

Playback quality

The playback quality of SolrWayback is an improvement over OpenWayback for the Danish Netarchive, but not as good as PyWb. The technique used is url-rewrite just as PyWb does, and replaces urls according to the HTML specification for html-pages and CSS files. However , SolrWayback does not replace links generated from javascript yet, but this is most likely to be improved in a next major release. It has not been a priority since the content for the The Danish NetArchive is harvested with Heritrix and the dynamic javascript resources are not harvested by Heritrix.

This is only a problem for absolute links, ie. starting with http://domain/… since all relative URL paths will be resolved automatically due to the URL playback API. Relative links that refer to the root of the playback-server will also be resolved by the SolrWaybackRootProxy application which has this sole purpose. It calculates the correct URL from the http-referer tags and redirect back into SolrWayback. The absolute URL from javascript (or dynamic javascript) can result in live leaks. This can be avoided by a HTTP proxy or just adding a white list of urls to the browser. In the Danish Citrix production environment, live leaks are blocked by sandboxing the enviroment. Improving playback is in the pipeline.

The SolrWayback playback has been designed to as authentic as possible without showing a fixed toolbar in top of the browser. Only a small overlay is included in the top left corner, that can be removed with a click, so that you see the page as it was harvested. From playback overlay you can open the calendar and an overview of the resources included by the HTML page along with their timestamps compared to the main HTML page, similar to the feature provided by the archive.org playback engine.

The URL replacement is done up front and fully resolved to an exact WARC file and offset. An HTML page can have 100 of different resources on the page and each of them require an URL lookup for the version nearest to the crawl time of the HTML page. All resource lookups for a single HTML page are batched as a single Solr query, which both improves performance and scalability.

SolrWayback and Scalability

For scalability, it all comes down to the scalability of SolrCloud, which has proven without a doubt to be one of the leading search technologies and is still rapidly improving for each new version. Storing the iindexes on SSD gives substantial performance boosts as well but can be costly. The Danish Netarchive has 126 Solr servers running in a SolrCloud setup.

One of the servers is master and the only one that recieve requests. The Solr master has an empty index but is responsible for gathering the data from the other Solr-services. If the master server also had an index there would be an overhead. 112 of the Solr servers have a 900 GB index with an average of ~300M documents while the last 13 servers currently has an empty index, but it makes expanding the collections easy without any configuration changes. Even with 32 billion documents, the query response times are sub 2 seconds. The result query and the facet query are seperate simultaneous calls and its advantage is that the result can be rendered very fast and the facets will finish loading later.

For very large results in the billions, the facets can take 10 seconds or more, but such queries are not realistic and the user should be more precise in limiting the results up front.

Building new shards
Building of new shards (collection pieces) is done outside the production enviroment and moved into one of the empty Solr servers when the index reaches ~900GB. The index is optimized before it is moved, since no more data will be written to it that would undo the optimization. This will also give a small performance improvement in query times. If the indexing was done directly into the production index, it would also impact response times. The separation of the production and building environment has spared us from dealing with complex problems we would have faced otherwise. It also makes speeding up the index building trivial by assigning more machines/CPU for the task and creating multiple indexes at once.
You can not keep indexing into the same shard forever as this would cause other problems. We found the sweet spot at that time to be ~900GB index size and it could fit on the 932GB SSDs that were available to us when the servers were built. The size of the index also requires more memory of each Solr server and we have allocated 8 GB memory to each. For our large scale webarchive we keep track of which WARC files has been indexed using Archon and Arctica.

Archon is the central server with a database and keeps track of all WARC files and if they have been index and into which shard number.

Arctika is a small workflow application that starts WARC-indexer jobs and query Arctika for next WARC file to process and return the call when it has been completed.

SolrWayback – framework

SolrWayback is a single Java Web application containing both the VUE frontend and Java backend. The backend has two Rest service interfaces written with Jax-Rs. One is responsible for services called by the VUE frontend and the other handles playback logic.

SolrWayback software bundle

Solrwayback comes with an out of the box bundle release. The release contains a Tomcat Server with Solrwayback, a Solr server and workflow for indexing. All products are configured. All that is required is unzipping the zip file and copying the two property-files to your home-directory. Add some WARC-files yourself and start the indexing job.

Try: SolrWayback Software bundle

Posted in open source, Solr, Visualization, Web | Tagged , , , | 1 Comment

SolrWayback 4.0 release! What’s it all about?

So, it’s finally here! SolrWayback 4.0 was released December 20th, after an intense development period. In this blog post, we’ll give you a nice little overview of the changes we made, some of the improvements and some of the added functionality that we’re very proud of having released. So let’s dig in!

A small intro – What is SolrWayback really?

As the name implies, SolrWayback is a fusion of discovery (Solr) and playback (Wayback) functionality. Besides full-text search, Solr provides multiple ways of aggregating data, moving common net archive statistics tasks from slow batch processing to interactive requests. Based on input from researchers the feature set is continuously expanding with aggregation, visualization and extraction of data.

SolrWayback relies on real time access to WARC files and a Solr index populated by the UKWA webarchive-discovery tool. The basic workflow is:

  • Amass a collection of WARCs (using Heritrix, wget, ArchiveIT…) and put them on live storage
  • Analyze and process the WARCs using webarchive-discovery. Depending on the amount of WARCS, this can be a fairly heavy job: Processing ½ petabyte of WARCs at the Royal Danish Library took 40+ CPU-years
  • Index the result from webarchive-discovery into Solr. For non-small collections, this means SolrCloud and Solid State Drives. A rule of thumb is that the index takes up about 5-10% of the size of the compressed WARCs
  • Connect SolrWayback to the WARC storage and the Solr index
A small visual illustration of the components used for SolrWayback.

Live demo

Try Live demo provided by National Széchényi Library, Hungary. (thanks!)

Helicopter view: What happend to SolrWayback

We decided to give the SolrWayback a complete makeover, making the interface more coherent, the design more stylish, and the information architecture better structured. At first glance, not much has changed apart from an update on the color scheme, but looking closely, we’ve added some new functionality, and grouped some of the existing features in a new, and much improved, way.

The new interface of SolrWayback.

The search page is still the same, and after searching you’ll still see all the results lined in a nice single column. We’ve added some more functionality up front, giving you the opportunity to see the WARC header for a single post, as well as selecting an alternative playback engine for the post. Some of the more noticeable reworks and optimizations are highlighted in the section below.

Faster loadtimes

We’ve done some work under the hood too, to make the application run faster. A lot of our call to the backend has been reworked to be individual calls, only being requested at need. This means, that facet calls are now made as a separate call to the backend instead of being being called with a query. So when you’re paging results, we only request the results – giving us a faster response, since the facets stay the same. The same principle has been applied to loading images and individual post data.

GUI polished

As mentioned, we’ve done some cleanup in the interface, making it easier to navigate. The search field has been reworked, to service the many needs. It will expand if the query is line separated (do so by SHIFT+Enter), making large and complex queries much easier to manage. We’ve even added context sensitive help, so if you’re making queries with boolean operators or similar, SolrWayback tell you if their syntax is correct.
We’ve kept the most used features upfront, with image and URL search readily available from the get go. The same goes for the option to group the search results to avoid URL duplicates.
Below the line are some of of the other features not directly linked to the query field, but nice to have upfront. Searching with an uploaded file, searching by GPS and the toolbox containing a lot of the different tools that can help gain insight into the archive, by generating Wordclouds or link graphs, searching through the Ngram interface and much more.

The nifty helper when making complex queries for SolrWayback.

Image searching by location rethought

We’re reworked the way to search and look through the results when searching by GPS coordinates. We’ve made it easy to search for a specific location, and we’ve grouped the results so that they are easier to interpret.

The new and improved location search interface. Images intentionally blurred.

Zooming into the map will expand the places where images are clustered. Furthermore, we realize that sometimes the need is to look through all the images regardless of their exact position, so we’ve made a split screen that can expand either way, depending on your needs. It’s still possible to do do a new search based on any of the found images in the list.

Elaborated export options

We’ve added a more functionality to the export options. It’s possible to export both fields from the full search result and the raw WARC records for the search result, if enabled in the settings. You can even decide the format of your export and we’ve added an option to select exactly which fields in the search result you want exported – so if you want to leave out some stuff, that is now possible!

Quickly move through your archive

The standard search UI is pretty much as you are accustomed to but we made an effort to keep things simple and clean as well as facilitating in depth research and tracking of subject interests. In the search results you get a basic outline of metadata on each post. You can narrow your search with the provided facet filters. When expanding a post you get access to all metadata and every field has a link if you which to explore a particular angle related to your post. So you can quickly navigate the archive by starting wide, filtering and afterwards do a specific drill down and find related material.

Visualization of search result by domain

We’ve also made it very easy to quickly get a overview of the results. When clicking the icon in the results headline, you get a complete overview of the different domains in the results, and how big of a portion of the search result they amount for to each year. This is a very neat way to get a overview of the results, and the relative distribution by year.

The toolbox

With quick access from right under the search box we have gathered Toolbox with utilities for further data exploration. In the following we will give you a quick tour of the updates and new functionality in this section.

Linkgraph, domain stats and wordcloud

Linkgraph.
Domain stats.
Wordcloud.

We reworked the Linkgraph, the Wordcloud and the Domain stats components a little, adding some more interaction to the graph and domain stats, and polished the interface for all of them a little. For the Linkgraph, it is now possible to highlight certain sections within the graph, making it much easier to navigate the sometimes rather large cluster, and looking at connections you find relevant. These tools now provide a easy and quick way to gain a deeper insight in specific domains and what content they hold.

Ngram

We are so pleased to finally be able to supply a classical Ngram search tool complete with graphs and all. In this version you are able to search through the entire HTML content of your archive and see how the results are distributed over time (harvest time). You can even do comparisons by providing several queries sequentially and see how they compare. On every graph the datapoint at each year is clickable and will trigger a search for the underlying results which is a very handy feature for checking the context and further exploring underlying data. Oh and before we forget – if things get a little crowded in the graph area you can always click on the nicely colored labels at the top of the chart and deselect/select each query.

The ngram interface.
The evolution of the blink tag.

If the HTML content isn’t really your thing but your passion lays within the HTML tags themselves we got you covered. Just flip the radio button under the search box over to HTML-tags in HTML-pages and you will have all same features listed above but now the underlying data will be the HTML tags themselves. As easy as that you will finally be able to get answers to important questions like ‘when did we actually start to frown upon the blink tag?’

The export functionality for Ngram.

Gephi Export

The possibilty to export a query, in a format that can be used in Gephi, is still present in the new version of SolrWayback.  This will allow you to create some very nice visual graphs that can help you explore how exactly a collection of results are tied together. If you’re interested in this, feel free to visit the labs website about gephi graphs, where we’ve showcasted some of the possiblities of using Gephi.

Tools for the playback

SolrWayback comes with a build in playback engine, but can be configured to use another playback engine such as PyWb. The SolrWayback playback viewer shows a small toolbar overlay on the page that can be opened or hidden. When the toolbar is hidden the page is display without any frame/top-toolbar etc. to show the page exactly as it was harvested.

The menu when you access the individual search results.

When you have clicked a specific result, you’re taking to the harvested resource. If it is a website, you will be shown a menu to the right, giving you some more ways to analyse the resource.  This menu is hidden in the left upper corner when you enter, but can be expanded by clicking on it.

The harvest calendar will give you a very smooth overview of the harvest times of the resource, so you can easily see when, and how often, the resource has been harvest in the current index. This gives you an excellent opportunity to look at your index over time, and see how a website evolved.

The date harvest calendar module.

The PWID option lets you export the harvest resource metadata, so you can share what’s in that particular resource in a nice and clean way. the PWID standard is an excellent way to keep track of, and share ressources between researchers, so a list of the exact dataset is preserved – along with all the resources to go with it

View page resources gives you a clean overview of the contents of the harvested page, along with all the resources. We’ve even added a way to quickly see the difference between the first and the last harvested resource on the page, giving you a quick hint of the contents and if they are all from the same period. You can even see a preview of the page here and download the individual resources from the page, if you wish.

Customization of your local SolrWayback instance

We’ve made it possible to customize your installation, to fit your needs. The logo can be changed, the about text can be changed, and you can even customize your search guidelines, if you need to. This makes sure, that you have a chance to make instance your own in some way – making sure that people can recognize when they are using your instance of SolarWayback, and it can now reflect your organisation and the people who is contributing to it.

The future of the SolrWayback

This is just the beginning for SolrWayback. Further down the road, we hope to add even more functionality that can help you dig deeper into the archives. One of our main goals is to provide you with the tools necessary to understand and analyse the vast amounts of data, that lies in most of the archives that SolrWayback is designed for. We already have a few ideas as to what could be useful, but if you have any suggestions for tools that might be helpful, feel free to reach out to us.

And a teaser: Thomas Egense is currently writing a blog post that focuses on the more technical aspects of SolrWayback. Stay tuned!

Posted in Frontend, Web | 2 Comments

Which type bug?

A light tale of bug hunting an Out Of Memory problem with SolrCloud.

The setup and the problem

At the Royal Danish Library we provide full text search for the Danish Netarchive. The heavy lifting is done in a single collection SolrCloud made up of 107 shards (for a total of 94TB / 32 billion documents). All queries are issued to a Solr instance with an empty shard, with the sole responsibility of aggregating responses from the real shards.

One of the frontends is SolrWayback, which is a JavaScript application backed by a middle layer acting as an advanced proxy; issuing searches, rewriting HTML, doing streaming exports and so.

The problem this time was that the aggregating Solr node occasionally crashed with an Out Of Memory error, where occasionally means that it sometimes took months to crash, sometimes days.

Clues and analysis

  • Access to the Netarchive Search is strictly controlled, so there were no chance of denial of service or fimilar foul play.
  • Log analysis showed modest activity (a maximum of 9 concurrent searches) around the time of the latest crash.
  • The queries themselves were run-of-the-mill, but the crashing queries themselves were not logged, as Solr only logs the query when is has been completed, not when it starts.
  • The Garbage Collection logs showed that everything was a-ok, right up til the time when everything exploded in progressively longer collections, culminating in a 29 second stop-the-World and no heap space left.
Heap graph with stop-the-world GC as red triangles, courtesy of gceasy.io

Should be simple to pinpoint, right? And (plot twist) for once it was! Of course we chased the usual red herrings, but ultimately “dissect the logs around the problematic time slots” won the day.

Pop quiz: What is wrong with the log entries below? (meticulously unearthed from too many nearly-but-not-fully-similar entries and with timestamps adjusted to match graph-timezone).

1) 2020-06-04 19:29:06.285 INFO (qtp1908316405-618188) [c:ns0 s:shard1 r:core_node2 x:ns0_shard1_replica_n1] o.a.s.c.S.Request [ns0_shard1_replica_n1] webapp=/solr path=/select params={q=facebook.com&facet.field=domain&facet.field=content_type_norm&facet.field=type&facet.field=crawl_year&facet.field=status_code&facet.field=public_suffix&hl=on&indent=true&fl=id,score,title,hash,source_file_path,source_file_offset,url,url_norm,wayback_date,domain,content_type,crawl_date,content_type_norm,type&start=100&q.op=AND&fq=record_type:response+OR+record_type:arc&fq=domain:"facebook.com"+AND+crawl_year:"2015"&rows=20&wt=json&facet=true&f.crawl_year.facet.limit=100} hits=53532331 status=0 QTime=2181
2) 2020-06-04 19:33:32.418 INFO (qtp1908316405-619134) [c:ns0 s:shard1 r:core_node2 x:ns0_shard1_replica_n1] o.a.s.c.S.Request [ns0_shard1_replica_n1] webapp=/solr path=/select params={q=facebook.com&facet.field=domain&facet.field=content_type_norm&facet.field=type&facet.field=crawl_year&facet.field=status_code&facet.field=public_suffix&hl=on&indent=true&fl=id,score,title,hash,source_file_path,source_file_offset,url,url_norm,wayback_date,domain,content_type,crawl_date,content_type_norm,type&start=10020&q.op=AND&fq=record_type:response+OR+record_type:arc&fq=domain:"facebook.com"+AND+crawl_year:"2015"&rows=20&wt=json&facet=true&f.crawl_year.facet.limit=100} hits=53527106 status=0 QTime=6958
3) 2020-06-05 20:33:26.204 INFO (qtp1908316405-639768) [c:ns0 s:shard1 r:core_node2 x:ns0_shard1_replica_n1] o.a.s.c.S.Request [ns0_shard1_replica_n1] webapp=/solr path=/select params={q=facebook.com&facet.field=domain&facet.field=content_type_norm&facet.field=type&facet.field=crawl_year&facet.field=status_code&facet.field=public_suffix&hl=on&indent=true&fl=id,score,title,hash,source_file_path,source_file_offset,url,url_norm,wayback_date,domain,content_type,crawl_date,content_type_norm,type&start=10020&q.op=AND&fq=record_type:response+OR+record_type:arc&fq=domain:"facebook.com"+AND+crawl_year:"2017"&rows=20&wt=json&facet=true&f.crawl_year.facet.limit=100} hits=3785666 status=0 QTime=3650
4) 2020-06-05 20:34:36.078 INFO (qtp1908316405-641342) [c:ns0 s:shard1 r:core_node2 x:ns0_shard1_replica_n1] o.a.s.c.S.Request [ns0_shard1_replica_n1] webapp=/solr path=/select params={q=facebook.com&facet.field=domain&facet.field=content_type_norm&facet.field=type&facet.field=crawl_year&facet.field=status_code&facet.field=public_suffix&hl=on&indent=true&fl=id,score,title,hash,source_file_path,source_file_offset,url,url_norm,wayback_date,domain,content_type,crawl_date,content_type_norm,type&start=1002020&q.op=AND&fq=record_type:response+OR+record_type:arc&fq=domain:"facebook.com"+AND+crawl_year:"2017"&rows=20&wt=json&facet=true&f.crawl_year.facet.limit=100} hits=3781705 status=0 QTime=39489
5) 2020-06-05 20:43:25.303 INFO (qtp1908316405-639769) [c:ns0 s:shard1 r:core_node2 x:ns0_shard1_replica_n1] o.a.s.c.S.Request [ns0_shard1_replica_n1] webapp=/solr path=/select params={q=facebook.com&facet.field=domain&facet.field=content_type_norm&facet.field=type&facet.field=crawl_year&facet.field=status_code&facet.field=public_suffix&hl=on&indent=true&fl=id,score,title,hash,source_file_path,source_file_offset,url,url_norm,wayback_date,domain,content_type,crawl_date,content_type_norm,type&start=1002020&q.op=AND&fq=record_type:response+OR+record_type:arc&fq=domain:"facebook.com"+AND+crawl_year:"2018"&rows=20&wt=json&facet=true&f.crawl_year.facet.limit=100} hits=15355247 status=0 QTime=166414

If your answer was “Hey, what’s up with start!?” then you are now officially a Big Search Analyst. Your badge will arrive shortly.

For those not catching it (that included me for a long time):

  1. A search is issued with a query for facebook material from 2015 with the parameters start=100&rows=20 (corresponding to page 6 in a UI which shows 20 results/page). Response time is 2 seconds.
  2. The same query is repeated, this time with start=10020&rows=20. If the intent was to go to page 7 in the UI, we would expect start=120&rows=20. Response time is 7 seconds.
  3. The query is changed to facebook material from 2017, still with start=10020&rows=20. Seems like someone’s URL hacking. Response time is 3½ seconds.
  4. Same query as in #4, but now with start=1002020&rows=20. Response time jumps to 39 seconds.
  5. The query is changed to facebook material from 2018, with the previous start=1002020&rows=20 intact. Response time jumps to 166 seconds.

Locating the error

Time to inspect the code responsible for the paging:

if (this.start + 20 < this.totalHits) {
    this.start = this.start + 20;
}

Seems innocent enough and when we tested by pressing “Next” a few times in SolrWayback, it did indeed behave exemplary: start=0, start=20, start=40 and so on.

Looking further down we encounter this nugget:

/* Method used on creation, reload and route change to get query parameters */
getQueryparams:function(){
    this.myQuery = this.$route.query.query;
    this.start= this.$route.query.start;
    this.filters = this.$route.query.filter;
    ...

A quick appliance of console.log(typeof this.start) in the right place tells us that when the UI page is reloaded, which happens when the URL is changed by hand, the type of this.start becomes a string!

Loosely typed languages is a taste not acquired by your humble author.

Back to the code for skipping to the next page: this.start = this.start + 20;
If this.start is 100 to begin with and if it is a string, we suddenly have "100" + 20, which JavaScript handles by casting the number 20 to the string 20: "100" + "20" = "10020". That translates to page 502 instead of page 2, which of course is not what the user wants, but how does it become a memory problem?

SolrCloud internals and the smoking gun

The SolrCloud for Netarchive Search is a distributed one (remember the 107 shards?), so when 20 documents starting at position 10020 are needed, the master must request start=0&rows=10040 document representations from each shard, sort them and deliver documents 10020-10039. For our setup that means holding up to 10040*107 = 1 million document representations in memory.

The master node has one job and this it it, so it handles the load. Yes, it bumps heap requirements temporarily with a gigabyte or two, but that’s okay. It still delivers the result in 7 seconds.

So what happens when the user presses Next again? Yes, "10020" + 20 = "1002020". That’s a factor 100 right there, as we move 2 decimal places. And master has -Xmx=8g… Fortunately the logged request only matched 15 million documents, so the master Solr got by with a 4GB bump to the heap (the first spike in the graph) at that time.

Knowing what to look for (start=xxxx, where xxxx is at least 4 digits), it is simple to find the last relevant log entry before the crash:
grep "start=[1-9][0-9][0-9][0-9]" solr.log.1

2020-06-08 08:29:43.898 INFO (qtp1908316405-709360) [c:ns0 s:shard1 r:core_node2 x:ns0_shard1_replica_n1] o.a.s.c.S.Request [ns0_shard1_replica_n1] webapp=/solr path=/select params={q=twitter.com&facet.field=domain&facet.field=content_type_norm&facet.field=type&facet.field=crawl_year&facet.field=status_code&facet.field=public_suffix&hl=on&indent=true&fl=id,score,title,hash,source_file_path,source_file_offset,url,url_norm,wayback_date,domain,content_type,crawl_date,content_type_norm,type&start=4020&q.op=AND&fq=record_type:response+OR+record_type:arc&fq=domain:"twitter.com"+AND+content_type_norm:"html"+AND+crawl_year:"2015"&rows=20&wt=json&facet=true&f.crawl_year.facet.limit=100} hits=53598240 status=0 QTime=3835

Here we have start=4020 and 54 million hits. The aggregating Solr died 10 minutes later. $10 says that the request that crashed the master Solr was for the same query, but with start=402020.

As 402020 document representations * 107 shards equals 43 million document representations, the master JVM might have survived with -Xmx=12g. If not for the huge amount of tiny objects overloading the garbage collector.

Fixes and take aways

Easy fix of course: Cast this.start in the JavaScript layer to integer and enforce an upper limit for start & rows in the middle layer for good measure.

For next time we’ve learned to

  • Closely examine slow queries (Captain Obvious says hello)
  • Keep GC-logs a few restarts back (we only had the current and the previous one to start with)
  • Plot the GC pauses to see if there are spikes that did not crash the JVM (trivial to do with gceasy.io), then inspect the request logs around the same time as the spikes

Posted in eskildsen, Solr | Tagged , | Leave a comment

Touching encouraged (an ongoing story)

A recurring theme at KB Labs is to show a lot of pixels. By chance we got our hands on a 4K touch-sensitive display, capable of showing a non-trivial amount of said pixels on a non-trivial surface area. Our cunning plan is to

  • Adapt some of the labs products to work on the display
  • Put the display somewhere in the public area of the library
  • Watch people swoon when they delve into beautiful cultural heritage data

This post is intended to be a diary of sorts, journaling what we learn.

Coincidental activation (2019-10-10)

We have talked about experimenting with interactive large displays for years. With emphasis on talked.

It took someone with youthful initiative to actually do something: Max Odsbjerg Pedersen discovered a usable & unused display and promptly sent us a video showing him using a labs product on the display. 4 days later he brokered a loan agreement and 10 days later we verified that no one questions two people removing a large display, as long as it is transported in a cardboard box.

Adding heavy box moving to résumé

Fair warning (2019-10-23)

The software development department has a – not entirely undeserved – reputation of being loose cannons that tend to muck about in ways that unexpectedly affects other departments.

To atone for blunders past and primarily because it really is the most constructive practice, representatives of the Cultural Heritage and the Communications departments were duly informed about the initiated process and invited to participate in discussion hereof. In other words: We met them at lunch as usual and said “Hey, we’ve got this nifty idea …”, to which they answered “Sounds good!”.

What have we got? (2019-10-24)

The display is a 55″ Samsung Flp. Its internal software seems focused on providing a digital flip-over with some connectivity possibilities? It does not have a build-in web browser, but connecting it to a Windows 10 machine is exceedingly easy. We will just have to duct tape a laptop to its back or something to that effect.

It was a pleasant surprise to discover that the excellent tool OpenSeadragon works perfectly out of the box with multi touch on desktop browsers: Tap, double tap, drag, pinch & spread. Well, as long as you are not a lazy developer that still use a pre-2017 version of OpenSeadragon where pinching is wonky *cough*.

Adapt, by which we mean “remove stuff” (2019-10-25)

Three KB Labs products, which would benefit from a large display, were selected:

At the core they are all web pages using OpenSeadragon and as such, adaptation mostly meant removing features and interface elements. A simple navigational area was added to switch between the products and the PoC Mark I alpha was born: Best viewed on a 55″ tablet or larger.

Secure, by which we mean “fail” (2019-10-25)

Since the whole thing is intended for public display & interaction, we want to make sure the users stay on the designated pages.

A developer navigating one of the designated pages

Pressing F11 switches to full screen with no navigational bar in Chrome and the end users does not have access to the keyboard, so problem solved? Our boss Bjarne Andersen passed by, stopped and played with the presentations. It took him 2 minutes to somehow activate right-click and presto, the box was cracked. Thanks boss!

Well, Chrome has a designated kiosk mode: Simply add -kiosk as an argument and all is taken care off. At least until co-worker Kim Christensen discovers that there is a handy swipe-from-a-vertical-edge gesture that opens the Windows menu and other elements. Cracked again. Thanks Kim!

Disabling swipe gestures did not seem possible without admin rights, which we do not have on the current computer. There seems to be a Windows kiosk mode that also requires admin rights. Oh well, maybe Monday? Weekend calls.

Broken Windows and tweaks (2019-10-28)

Colleague Thomas Egense brought a private windows laptop to work (no worries, we only connect those to the eduroam network).

  • It would not connect to the large display. Reboot.
  • It did connect to the display in 4K, but not to WiFi. Reboot.
  • It did connect to WiFi, but would no longer connect to the display. Reboot.
  • Same. Reboot.
  • Same. Give up.
  • Actively avoid defenestrating the laptop. Drink coffee.

At least it went a little better when colleague Gry Vindelev Elstrøm stopped by. She suggested adding some sort of map overlay, so that the users would not get lost in the big collages? And of course OpenSeadragon delivers again: 90 seconds and a reload later the wish were granted:

OpenSeadragon with Navigator overlay

Gry’s other wish: To have visual-similarity spatial grouping of the maps collection is both valid and entirely possible to fulfill. Buuut it is not a 90 second job and the touch screen project is a side project, so that idea was filed under when We Find The Time.

And then they were two (2019-10-29)

Heroic display digger Max Odsbjerg Pedersen phoned in and said he had found a twin display lying around. He’ll put it up somewhere at AU Library, Nobel Park, mirroring the display we’re working with at the Reoyal Danish Library, Aarhus. Thank you, Max. You do realize we’re at the early Proof on Concept stage, right?

Go ahead is a given (2019-10-30)

Gitte Bomholt Christensen deals with the public space at the library. She visited to take a look at the project. Her first question was if we should put the display on a movable stand or if a wall mount were better. We’ll take that as a “Yes, we’ll go forward with this experiment”.

Soon they will be five (2019-10-31)

Early in the day miniboss Katrine Hofmann Gasser asked for requirements for 3 extra touch displays. Later in the day, miniboss Katrine Hofmann Gasser had ordered 3 extra touch displays.

Damn, people! What happened to the concept of testing a minimum viable product followed by iterative improvement?

The hunt for 4K (2019-11-05)

The afternoon was not free (they never are), but at least it was not bogged down with other stuff. So what about upping the resolution from HD to 4K? How hard can that be? Yeah, 4 trips to Operations and 3 different computers produced the new knowledge that passive DisplayPort → HDMI cables have trouble delivering the goods. Native HDMI 1.4 handles 4k though: Admittedly at 30Hz only, but that works well enough when the interface reflects finger movements directly. The only situation where the 30Hz is noticeable is when the user pans by flinging.

Gridified tSNE (2019-11-06)

Running image collections through a trained network and positioning them according to their visual similarity is one of those “the future is now”-things. One favourite is Pix-Plot which produces an interactive 3D-visualization. But the touch screen is meant for large images and Pix-Plot is not made to display those. Plotting directly to 2D does not solve the problem:

A bit hard to enjoy all the images when they cover each other

A marriage between Pix-Plot and the existing zoomable grid-based layout was proposed. Some hacking later with the tools ml4a & RasterFairy and… Yeah, kinda-sorta? As can be seen on the screenshot below, there are definitely areas of similar images, but there are also multiple areas that looks like they should be collapsed into one. Something’s off, but that will have to be solved later.

There’s definitely some grouping there. And groups that looks suspiciously similar
There are no image duplicates – we checked!

Frontpage material (2019-11-14)

Thomas Egense wanted something else on the large touch screen, so he extracted all frontpages from the Danish newspaper Berlingske Tidende, available from Mediestream (of course he cheated and took an internal shortcut). It is quite an old newspaper, so “all” means 68,367 pages.

Rendering 68K fairly-high-res images is no technical problem, but as our scans are greyscale the result was somewhat bland and extremely cumbersome to navigate with intention.

A sea of grey

Thankfully newspaper frontpages do possess one singular reliable piece of metadata: The date of the paper. Adding an ugly date picker was easy enough and presto! Intuitive navigation of the primary navigational axis for the material.

Proper tSNE (2019-11-25)

A breakthrough discovery was made today: If you clumsily swap the x and y axis for the coordinates, but keep the calculated width and height, when you plot gridified data, the result is … less good. Corollary: If you un-swap said axes the result looks much better! As demonstrated by these before and after images:

Sorry about doing this on a not-fully-public-yet dataset (the awesome “anskuelsesbilleder” at AU Library, Emdrup): We can only show the scaled-down versions of the properly gridified tSNE layout, but they should convey the gist.

Maybe machines can label our stuff? (2019-11-27)

Since machine learning was great at positioning images according to visual similarity (or rather a mix of visual similarity and content similarity), maybe use it to automatically label our material? Well, not so much with the collection of anskuelsesbilleder: The network (imagenet) is great for labelling photographies but poor for drawings: “Binder”, “Web site”, “Envelope” and “Jigsaw puzzle” were recurring and absolutely wrong guesses. Again, sorry for not being allowed to show the images. Hopefully later, when the rights has been cleared, ok?

Ideas aplenty (2019-11-27)

Karen Williams was the nearest innocent bystander to show the latest experiment with the large touch screen and she upped the ante, asking for drive-by crowd-sourcing capabilities and visualization of sound. So much untapped potential in this!

Organisations gotta organise (2019-11-28)

One does not simply walk down a put a touch screen on the wall. It is hard to have patience with a new toy in hand, but it is understandable that a mix of people from different departments must participate on something that involves display of cultural heritage data in the public areas of the library. Unfortunately it will be nearly 2 weeks before said mix of people are able to meet and determine how to go about the project. Deep breath. Inner peace. Tempus fugit.

Pong detour (2019-12-06)

Annual christmas party time! And Jesper Lauridsen did not miss the oportunity that a big touch screen presented. He whipped up a multi-ball Pong game where the balls were the faces of the people at the party. Will you be hitting your colleague with a bat or let said colleague fall into oblivion? Great success and nobody spilled beer on the touch table! And no, sorry, not allowed to show it due to the face thing. Privacy and all.

Last details finished in the Real Hardware department by leet hacker Jesper

Proper public tSNE (2019-12-09)

The image classification → tSNE → RasterFairy → juxta chain is our new golden hammer, and the next collection to hits were our Maps & Atlases collection. Given that the network was never trained explicitly for the minute differences in maps, it went surprisingly well. And this time we’re allowed to show the result:

Don’t just sit there! Try it yourself

Secure, by which we mean “nearly succeed” (2019-12-10)

There was a meeting with The Right People and it took all of 3½ minute to decide that yes, the large tablet should definitely be displayed in the public areas. Then it took 10 minutes to hammer out the details: The plan is to mount in on wheels and move it around to see where it works best. Progress!

The afternoon was spend trying to make the big screen less easy to hack. It is driven by an Ubuntu 19.10 box using Google Chrome as the browser. As discovered earlier, Chrome has a “kiosk” mode, which disables right click, the address bar and more. Easy!

The real problem was Ubuntu itself: It has tablet support, meaning clever swipe gestures that activates program selection, unmaximizes windows, shows an on screen keyboard and other goodies. Goodies meaning baddies, when trying to build a display kiosk! Most of the solution was to use the Disable Gestures extension (and reboot to get the full disablement), but the on screen keyboard (activated by swiping in from the bottom of the screen) is apparently hard baked into the system (Block Caribou did not help us). We might have to uninstall it completely.

To be continued

Posted in eskildsen, Visualization | Leave a comment

DocValues jump tables in Lucene/Solr 8

Lucene/Solr 8 is about to be released. Among a lot of other things is brings LUCENE-8585, written by your truly with a heap of help from Adrien Grand. LUCENE-8585 introduces jump-tables for DocValues, is all about performance and brings speed-ups ranging from worse than baseline to 1000x, extremely dependent on index and access pattern.

This is a follow-up post to Faster DocValues in Lucene/Solr 7+. The previous post contains an in-depth technical explanation of the DocValues mechanisms, while this post focuses on the final implementation.

DocValues?

Whenever the content of a field is to be used for grouping, faceting, sorting, stats or streaming in Solr (or Elasticsearch or Lucene, where applicable), it is advisable to store it using DocValues. It is also used for document retrieval, depending on setup.

DocValues in Lucene 7: Linked lists

Lucene 7 shifted the API for DocValues from random access to sequential. This meant smaller storage footprint and cleaner code, but also caused the worst case single value lookup to scale linear with document count: Getting the value for a DocValued field from the last document in a segment required a visit to all other value blocks.

The linear access time was not a problem for small indexes or requests for values for a lot of documents, where most blocks needs to be visited anyway. Thus the downside of the change was largely unnoticeable or at least unnoticed. For some setups with larger indexes, it was very noticeable and for some of them it was also noticed. For our netarchive search setup, where each segments has 300M documents, there was a severe performance regression: 5-10x for common interactive use.

Text book optimization: Jump-tables

The Lucene 7 DocValues structure behaves as a linked list of data-nodes, with the specializations that it is build sequentially and that it is never updated after the build has finished. This makes it possible to collect the node offsets in the underlying index data during build and to store an array of these offsets along with the index data.

With the node offsets quickly accessible, worst-case access time for a DocValue entry becomes independent of document count. Of course, there is a lot more to this: See the previously mentioned Faster DocValues in Lucene/Solr 7+ for details.

One interesting detail for jump-tables is that they can be build both as a cache on first access (see LUCENE-8374) and baked into the index-data (see LUCENE-8585). I much preferred having both options available in Lucene, to get instant speed up with existing indexes and technically superior implementation for future indexes. Alas, only LUCENE-8585 was deemed acceptable.

Best case test case

Our netarchive search contains 89 Solr collections, each holding 300M documents in 900GB of index data. Each collection is 1 shard, merged down to 1 segment and never updated. Most fields are DocValues and they are heavily used for faceting, grouping, statistics, streaming exports and document retrieval. The impact of LUCENE-8585 should be significant.

In netarchive search, all collections are searched together using an alias. For the tests below only a single collection was used for practical reasons. There are three contenders:

  1. Unmodified Solr 7 collection, using Solr 8.0.0 RC1. Codename Solr 7.
    In this setup, jump-tables are not active as Solr 8.0.0 RC1, which includes LUCENE-8585, only supports index-time jump-tables. This is the same as Solr 7 behaviour.
  2. Solr 7 collection upgraded to Solr 8, using Solr 8.0.0 RC1. Codename Solr 8r1.
    In this setup, jump-tables are active and baked into the index data. This is the expected future behaviour when Solr 8 is released.
  3. Solr 7 collection, using Lucene/Solr at git commit point 05d728f57a28b9ab83208eda9e98c3b6a51830fc. Codename Solr 7 L8374.
    During LUCENE-8374 (search time jump tables) development, the implementation was committed to master. This was later reverted, but the checkout allow us to see what the performance would have been if this path had been chosen.

Test hardware is a puny 4-core i5 desktop with 16GB of RAM, a 6TB 7200RPM drive and a 1TB SSD. About 9GB of RAM free for disk cache. Due to time constraints only the streaming export test has been done on the spinning drive, the rest is SSD only.

Streaming exports

  • Premise: Solr’s export function is used by us to extract selected fields from the collection, typically to deliver a CSV-file with URLs, MIME types, file sizes etc for a corpus defined by a given filter. It requires DocValues to work.
  • DV-Problem: The current implementation of streaming export in Solr does not retrieve the field values in document order, making the access pattern extremely random. This is absolute worst case for sequential DocValues. Note that SOLR-13013 will hopefully remedy this at some point.

The test performs a streaming export of 4 fields for 52,653 documents in the 300M index. The same export is done 4 times, to measure the impact of caching.

curl '<solr>/export?q=text:hestevogn&sort=id+desc&
fl=content_type_ext,content_type_served,crawl_date,content_length'
run 1
seconds
run 2
seconds
run 3
seconds
run4
seconds
Solr 7 spin1705129713521314
Solr 8r1 spin834321
Solr 7 L8374 spin935111
Solr 7 SSD1276125812621262
Solr 8r1 SSD16121
Solr 7 L8374 SSD151
11

Observation: Both Solr 8r1 and Solr 7 L8374 vastly outperforms Solr 7. On a spinning drive there is a multi-minute penalty for run 1 after which the cache has been warmed. This is a well known phenomenon.

Faceting

  • Premise: Faceting is used everywhere and it is a hard recommendation to use DocValues for the requested fields.
  • DV-Problem: Filling the counters used when faceting is done in document order, which works well with sequential access as long as the jumps aren’t too long: Small result sets are relatively heavier penalized than large result sets.
Simple term-based searches with top-20 faceting on 6 fields of varying type and cardinality: domain, crawl_year, public_suffix, content_type_norm, status_code and host.

Reading the graphs: All charts in this blog post follows the same recipe:

  • X-axis is hit count (aka result set size), y-axis is response time (lower is better)
  • Hit counts are bucketed by order of magnitude and for each magnitude, boxes are shown for the three contenders: Blue boxes are Solr 7, pink are Solr 8r1 and green are Solr 7 L8374
  • The bottom of a box is the 25 percentile, the top is the 75 percentile. The black line in the middle is the median. Minimum response time for the bucket is the bottom spike, while the top spike is 95 percentile
  • Maximum response times are not shown as they tend to jitter severely due to garbage collection

Observation: Modest gains from jump-tables with both Solr 8rc1 and Solr 7 L8374.
Surprisingly the gains scale with hit count, which should be investigated further.

Grouping

  • Premise: Grouping is used in netarchive search to collapse multiple harvests of the same URL. As with faceting, using DocValues for grouping fields are highly recommended.
  • DV-Problem: As with faceting, group values are retrieved in document order and follows the same performance/scale logic.
Simple term-based searches with grouping on the high-cardinality (~250M unique values) field url_norm.

Observations: Modest gains from jump-tables, similar to faceting.

Sorting

  • Premise: Sorting is a basic functionality.
  • DV-Problem: As with faceting and grouping, the values used for sorting are retrieved in document order and follows the same performance/scale logic.

This tests performs simple term-based searches with sorting on the high-cardinality field content_length.

Observations: Modest gains from jump-tables.
Contrary to faceting and grouping, performance for high hit counts are the same for all 3 setups, which fits with the theoretical model.
Positively surprising is that the theoretical overhead of the jump-tables does not show for higher hit counts.

Document retrieval

  • Premise: Content intended for later retrieval can either be stored explicitly or as docValues. Doing both means extra storage, but also means that everything is retrieved from the same stored (and compressed) blocks, minimizing random access to the data. For the netarchive search at the Royal Danish Library we don’t double-store field data and nearly all of the 70 retrievable fields are docValues.
  • DV-Problem: Getting a search result is a multi-step process. Early on, the top-X documents matching the query are calculated and their IDs are collected. After that the IDs are used for retrieving document representations. If this is done from DocValues, it means random access linear to the number of documents requested.
Simple term-based relevance ranked searches for the top-20 matching documents with 9 core fields: id, source_file_s, url_norm, host, domain, content_type_served, content_length, crawl_date and content_language.

Observations: Solid performance improvement with jump-tables.

Production request

  • Premise: The different functionalities are usually requested in combination. At netarchive search a typical request uses grouping, faceting, cardinality counting and top-20 document retrieval.
  • DV-Problem: Combining functionality often means that separate parts of the index data are accessed. This can cause cache thrashing if there is not enough free memory for disk cache. With sequential DocValues, all intermediate blocks needs to be visited, increasing the need for disk cache. Jump-tables lowers the number of storage requests and are thus less reliant on cache size.
Simple term-based relevance ranked searches for the top-20 matching documents, doing grouping, faceting, cardinality and document retrieval as described in the tests above.

Observations: Solid performance improvement with jump tables.
As with the previous analysis of search-time jump tables, utilizing multiple DocValues-using functionality has a cocktail effect where the combined impact is larger than the sum of the parts. This might be due to disk cache thrashing.

Overall observations & conclusions

  • The effect of jump tables, both with Solr 8.0.0 RC1 and LUCENE-8374, is fairly limited; except for export and document retrieval, where the gains are solid.
  • The two different implementations of jump tables performs very similar. Do remember that these tests does not involve index updates at all: As LUCENE-8374 is search-time, it does have a startup penalty when indexes are updated.

For a the large segment index tested above, the positive impact of jump tables is clear. Furthermore there is no significant slow down for higher hit counts with faceting/grouping/statistics, where the jump tables has no positive impact.

Before running these tests, it was my suspicion that the search-time jump tables in LUCENE-8374 would perform better than the baked-in version. This showed not to be the case. As such, my idea of combining the approaches by creating in-memory copies of some of the on-storage jump tables has been shelved.

Missing

Performance testing is never complete, it just stops. Some interesting thing to explore could be

  • Spinning drives
  • Concurrent requests
  • Raw search speed with rows=0
  • Smaller corpus
  • Variations of rows, facet.limit and group.limit
  • Kibana and similar data-visualization tools
Posted in eskildsen, Hacking, Low-level, Lucene, Performance, Solr, Uncategorized | 7 Comments

Faster DocValues in Lucene/Solr 7+

This is a fairly technical post explaining LUCENE-8374 and its implications on Lucene, Solr and (qualified guess) Elasticsearch search and retrieval speed. It is primarily relevant for people with indexes of 100M+ documents.

Teaser

We have a Solr setup for Netarchive Search at the Royal Danish Library. Below are response times grouped by the magnitude of the hitCount with and without the Lucene patch.

teaser_run3_solrwayback

Grouping on url_norm, cardinality stats on url_norm, faceting on 6 fields and retrieval of all stored & docValued fields for the top-10 documents in our search result.

As can be seen, the median response time with the patch is about half that of vanilla Solr. The 95% percentile shows that the outliers has also been markedly reduced.

Long explanation follows as to what the patch does and why indexes with less than 100M documents are not likely to see the same performance boost.

Lucene/Solr (birds eye)

Lucene is a framework for building search engines. Solr is a search engine build using Lucene. Lucene, and thereby Solr, is known as an inverted index, referring to the termsdocuments structure that ensures fast searches in large amounts of material.

As with most things, the truth is a little more complicated. Fast searches are not enough: Quite obviously it also helps to deliver a rich document representation as part of the search. More advanced features are grouping, faceting, statistics, mass exports etc. All of these have in common that they at some point needs to map documentsterms.

Lucene indexes are logically made up of segments containing documents made up of fields containing terms (or numbers/booleans/raws…). Fields can be

  • indexed for searching, which means termsdocuments lookup
  • stored for document retrieval
  • docValues for documentsterms lookup

stored and docValues representations can both be used for building a document representation as part of common search. stored cannot be used for grouping, faceting and similar purposes. The two strengths of stored are

  • Compression, which is most effective for “large” content.
  • Locality, meaning that all the terms for stored fields for a given document are stored together, making is low-cost to retrieve the content for multiple fields.

Whenever grouping, faceting etc. needs the documentsterms mapping, it can either be resolved from docValues, which are build for this exact purpose, or by un-inverting the indexed terms. Un-inversion costs time & memory, so the strong recommendation is to enable docValues for grouping, faceting etc.

DocValues in Lucene/Solr 7+ (technical)

So the mission is to provide a documentsterms (and numbers/booleans/etc) lookup mechanism. In Lucene/Solr 4, 5 & 6 this mechanism had a random access API, meaning that terms could be requested for documents in no particular order. The implementation presented some challenges and from Lucene/Solr 7 this was changed to an iterator API (see LUCENE-7407), meaning that terms must be resolved in increasing document ID order. If the terms are needed for a document with a lower ID that previously requested, a new iterator must be created and the iteration starts from the beginning.

Most of the code for this is available in Lucene70DocValuesProducer and IndexedDISI. Digging into it, the gains from the iterative approach becomes apparent: Besides a very clean implementation with lower risk of errors, the representation is very compact and requires very little heap to access. Indeed, the heap requirement for the search nodes in Netarchive Search at the Royal Danish Library was nearly halved when upgrading from Solr 4 to Solr 7. The compact representation is primarily the work of Adrian Grand in LUCENE-7489 and LUCENE-7589.

When reading the wall of text below, it helps to mentally view the structures as linked lists: To get to a certain point in the list, all the entries between the current entry and the destination entry needs to be visited.

DocValues sparseness and packing

It is often the case that not all documents contains terms for a given field. When this is case, the field is called sparse.

A trivial representation for mapping documentsterms for a field with 0 or 1 long values per document would be an array of long[#documents_in_segment], but this takes up 8 bytes/document, whether the document has a value defined or not.

LUCENE-7489 optimizes sparse values by using indirection: First step is to determine whether a document has a value or not. If it has a value, an index into a value-structure is derived. The second step is to retrieve the value from the value-structure. IndedDISI takes care of the first step:

For each DocValues field, documents are grouped in blocks of 65536 documents. Each block starts with meta-data stating the block-ID and the number of documents in the block that has a value for the field. There are 4 types of blocks:

  • EMPTY: 0 documents in the block has a term.
  • SPARSE: 1-4095 documents in the block has a term.
  • DENSE: 4096-65535 documents in the block has a term.
  • ALL: 65536 documents in the block has a term.

Step 1.1: Block skipping

To determine if a document has a value and what the index of the value is, the following pseudo-code is used:

while (blockIndex < docID/65536) {
  valueIndex += block.documents_with_values_count
  block = seekToBlock(block.nextBlockOffset)
  blockIndex++}
if (!block.hasValue(docID%65536)) {  // No value for docID
  return
}
valueIndex += block.valueIndex(docID%65536)

Unfortunately it does not scale with index size: At the Netarchive at the Royal Danish Library, we use segments with 300M values (not a common use case), which means that 4,500 blocks must be iterated in the worst case.

Introducing an indexValue cache solves this and the code becomes

valueIndex = valueCache[docID/65536]
block = seekToBlock(offsetCache[docID/65536])
if (!block.hasValue(docID%65536) {  // No value for docID
  return
}
valueIndex += block.valueIndex(docID%65536)

The while-loop has been removed and getting to the needed block is constant-time.

Step 1.2: Block internals

Determining the value index inside of the block is trivial for EMPTY and ALL blocks. SPARSE is a list of the documentIDs with values that is simply iterated (this could be a binary search). This leaves DENSE, which is the interesting one.

DENSE blocks contains a bit for each of its 65536 documents, represented as a bitmap = long[1024]. Getting the value index is a matter of counting the set bits up to the wanted document ID:

inBlockID = docID%65536
while (inBlockIndex < inBlockID/64) {
  valueIndex += total_set_bits(bitmap[inBlockIndex++])
}
valueIndex += set_bits_up_to(bitmap[inBlockIndex], inBlockID%64)

This is not as bad as it seems as counting bits in a long is a single processor instruction on modern CPUs. Still, doing 1024 of anything to get a value is a bit much and this worst-case is valid for even small indexes.

This is solved by introducing another cache: rank = char[256] (a char is 16 bytes):

inBlockID = docID%65536
valueIndex = rank[inBlockID/8]
inBlockIndex = inBlockID/8*8
while (inBlockIndex < inBlockID/64) {
  valueIndex += total_set_bits(bitmap[inBlockIndex++])
}
valueIndex += set_bits_up_to(bitmap[inBlockIndex], inBlockID%64)

Worst-case it reduced to a rank-cache lookup and summing of the bits from 8 longs.

Now that step 1: Value existence and value index has been taken care of, the value itself needs to be resolved.

Step 2: Whole numbers representation

There are different types of values Lucene/Solr: Strings, whole numbers, floating point numbers, booleans and binaries. On top of that a field can be single- or multi-valued. Most of these values are represented in a way that provides direct lookup in Lucene/Solr 7, but whole numbers are special.

In Java whole numbers are represented in a fixed amount of bytes, depending on type: 1 byte for byte, 2 bytes for short or char, 4 bytes for integer and 8 bytes for long. This is often wasteful: The sequence [0, 3, 2, 1] could be represented using only 2 bits/value. The sequence [0, 30000000, 20000000, 10000000] could also be represented using only 2 bits/value if it is known that the greatest common divisor is 10⁷. The list of tricks goes on.

For whole numbers, Lucene/Solr uses both the smallest amount of bits required by PackedInts for a given sequence as well as greatest common divisor and constant offset. These compression techniques works poorly both for very short sequences and for very long ones; LUCENE-7589 splits whole numbers into sequences of 16384 numbers.

Getting the value for a given index is a matter of locating the right block and extracting the value from that block:

while (longBlockIndex < valueIndex/16386) {
  longBlock = seekToLongBlock(longBlock.nextBlockOffset)
  longBlockIndex++
}
value = longBlock.getValue(valueIndex%16386)

This uses the same principle as for value existence and the penalty for iteration is also there: In our 300M documents/segment index, we have 2 numeric fields where most values are present. They have 28,000 blocks each, which must be all be visited in the worst case.

The optimization is the same as for value existence: Introduce a jump table.

longBlock = seekToLongBlock(longJumps[valueIndex/16384))
value = longBlock.getValue(valueIndex%16386)

Value retrieval becomes constant time.

Theoretical observations

  • With a pure iterative approach, performance goes down when segment size goes up and the amount of data to retrieve goes up slower than index size. The performance slowdown only happens after a certain point! As long as the gap between the docIDs is small enough to be within the current or the subsequent data chunk, pure iteration is fine.
    Consequently, the requests that involves lots of monotonically increasing docID lookups (faceting, sorting & grouping for large result sets) fits the iterative API well as they needs data from most data blocks.
    Requests that involves fewer monotonically increasing docID lookups (export & document retrieval for all requests, faceting, sorting & grouping for small result sets) fits poorly as they result in iteration over data blocks that do not provide any other information than a link to the next data block.
  • As all the structures are storage-backed, iterating all data blocks – even when it is just to get a pointer to the next block – means a read request. This is problematic, unless there is plenty of RAM for caching: Besides the direct read-time impact, the docValues structures will hog the disk cache.

With this in mind, it makes sense to check the patch itself for performance regressions with requests for a lot of values as well as test with the disk cache fully warmed and containing the structures that are used. Alas, this has to go on the to-do for now.

Tests

Hardware & index

Testing was done against our production Netarchive Search. It consists of 84 collections, accessed as a single collection using Solr’s alias mechanism. Each collection is roughly 300M documents / 900GB of index data optimized to 1 segment, each segment on a separate SSD. Each machine has 384GB of RAM with about 220GB free for disk cache. There are 4 machines, each serving 25 collections (except the last one that only serves 9 at the moment). This means that ~1% of total index size is disk cached.

Methodology

  • Queries were constructed by extracting terms of varying use from the index and permutating them for simple 1-4 term queries
  • All tests were against the full production index, issued at times when it was not heavily used
  • Queries were issued single-threaded, with no repeat queries
  • All test setups were executed 3 times, with a new set of queries each time
  • The order of patch vs. sans-patch tests was always patch first, to ensure that any difference in patch favour was not due to better disk caching

How to read the charts

All charts are plotted with number of hits on the x-axis and time on the y-axis. The x-axis is logarithmic with the number of hits bucketed by magnitude: First bucket holds all measurements with 1-9 hits, second bucket holds those with 10-99 hits, the third holds those with 100-999 hits and so forth.

The response times are displayed as box plots where

  • Upper whisker is the 95% percentile
  • Top of the box is 75% percentile
  • Black bar is 50% percentile (the median)
  • Bottom of the box is 25% percentile
  • Lower whisker is minimum measured time

Each bucket holds 4 boxes

  • Test run 2, patch enabled
  • Test run 2, vanilla Solr
  • Test run 3, patch enabled
  • Test run 3, vanilla Solr

Test run 1 is discarded to avoid jitter from cache warming. Ideally the boxes from run 3 should be the same as for run 2. However, as the queries are always new and unique, an amount of variation is to be expected.

Important note 1: The Y-axis max-value changes between some of the charts.

Document retrieval

There seems to be some disagreement as to whether the docValues mechanism should ever be used to populate documents, as opposed to using stored. This blog post will only note that docValues are indeed used for this purpose at the Royal Danish Library and let it be up to the reader to seek more information on the matter.

There are about 70 fields in total in Netarchive Search, with the vast majority being docValued String fields. There are 6 numeric DocValued fields.

docs_run23_rows20

Retrieval of top-20 documents with all field values

Observation: Response times for patched (blue & green) are markedly lower than vanilla (ping & orange). The difference is fairly independent of hit count, which matches well with the premise that the result set size is constant at 20 documents.

Grouping

Grouping on the String field url_norm field is used in Netarchive Search to avoid seeing too many duplicates. To remove the pronounced difference caused by document retrieval, only the single field url_norm is requested for only 1 group with 1 document.

grouping_run23_rows1_fl1

Grouping on url_norm

Observation: The medians for patched vs. vanilla are about the same, with a slight edge to patched. The outliers (the top T of the boxes) are higher for vanilla.

Faceting

Faceting is done for 6 fields of varying cardinality. As with grouping, the effect of document retrieval is sought minimized.

faceting_run23_rows1_fl1

Faceting on fields domain, crawl_year, public_suffix, content_type_norm, status_code, host

Observation: Patched is an improvement over vanilla up to 10M+ hits.

Sorting

In this test, sorting is done descending on content_length, to locate the largest documents in the index. As with grouping, the effect of document retrieval is sought minimized.

sorting_run23_rows1_fl1

Sorting on content_length

Observation: Patched is a slight improvement over vanilla.

Cardinality

In order to provide an approximate hitCount with grouping, the cardinality of the url_norm field is requested. As with grouping, the effect of document retrieval is sought minimized.

HyperLogLog cardinality on url_norm

HyperLogLog cardinality on url_norm

Observation: Too much jitter to say if patch helps here.

Numeric statistics

Statistics (min, max, average…) on content_length is a common use case in Netarchive Search. As with grouping, the effect of document retrieval is sought minimized.

stats__rows1_fl1

Numeric statistics on content_length

Observation: Patched is a slight improvement over vanilla.

Cocktail effect, sans document

Combining faceting, grouping, stats and sorting while still minimizing the effect of document retrieval.

multi_rows1_fl1

Faceting on 6 fields, grouping on url_norm, stats on content_length and sorting on content_length

Observation: Patched is a clear improvement over vanilla.

Production request combination

The SolrWayback front end for Netarchive Search commonly use document retrieval for top-10 results, grouping, cardinality and faceting. This is the same chart as the teaser at the top, with the addition of test run 2.

cocktail_run23_solrwayback

Grouping on url_norm, cardinality stats on url_norm, faceting on 6 fields and retrieval of all stored & docValued fields for the top-10 documents in our search result.

Observation: Patched is a pronounced improvement over vanilla.

The combination of multiple docValues using request parameters is interesting as the effect of the patch on the whole seems greater than the sum of the individual parts. This could be explained by cache/IO saturation when using vanilla Solr. Whether the cause, this shows that it is important to try and simulate real-world workflows as close as possible.

Overall observations

  • For most of the performance tests, the effect of the LUCENE-8374 patch vs. vanilla is pronounced, but limited in magnitude
  • Besides lowing the median, there seems to be a tendency for the patch to reduce outliers, notably for  grouping
  • For document retrieval, the patch improved performance significantly. Separate experiments shows that export gets a similar speed boost
  • For all the single-feature tests, the active parts of the index data are so small that they are probably cached. Coupled with the limited improvement that the patch gives for these tests, it indicates that the patch will in general have little effect on systems where the full index is is disk cache
  • The performance gains with the “Production request combination” aka the standard requests from our researchers, are very high

Future testing

  • Potential regression for large hit counts
  • Max response times (not just percentile 95)
  • Concurrent requests
  • IO-load during tests
  • Smaller corpus
  • Export/streaming
  • Disk cache influence

Want to try?

There is a patch for Solr trunk at LUCENE-8374 and it needs third party validation from people with large indexes. I’ll port it to any Solr 7.x-version requested and if building Solr is a problem, I can also compile it and put it somewhere.

Hopefully it will be part of Solr at some point.

Update 20181003: Patch overhead and status

Currently the patch is search-time only. Technically is could also be index-time by modifying the codec.

For a single index in the Netarchive Search setup, the patch adds 13 seconds to first search-time and 31MB of heap out of 8GB allocated for the whole Solr. The 13 seconds is in the same ballpark (this is hard to measure) as a single unwarmed search with top-1000 document retrieval.

The patch is ported to Solr 7.3.0 and used in production at the Royal Danish Library. It is a debug-patch, meaning that the individual optimizations can be enabled selectively for easy performance comparison.

See the LUCENE-8374 JIRA-issue for details.

Posted in eskildsen, Hacking, Low-level, Lucene, Performance, Solr | 1 Comment

Prebuild Big Data Word2Vec dictionaries

                   Prebuild and trained Word2Vec dictionaries ready for use

Two different prebuild big data Word2Vec dictionaries has been added to LOAR (Library Open Access Repository) for download. These dictionaries are build from the text of 55,000 e-books from Project Gutenberg and 32.000.000 Danish newspaper pages.

35.000 of the Gutenberg e-books are English, but over 50 different languages are present in the dictionaries. Even though they are different languages the Word2Vec algorithm did a good job of separating the different languages so it is almost like 50 different Word2Vec dictionaries.

The text from the danish newspapers is not public available so you would not be able to build this dictionary yourself. A total of 300Gb of raw text went into building the dictionary, so it is probably the largest Word2Vec dictionary build on a Danish corpus. Since the danish newspapers suffer from low quality OCR, many of words in the dictionary are misspellings. Using this dictionary it was possible to fix many of the OCR errors due the nature of the Word2Vec algorithm, since a given word appears in similar contexts despite its misspellings and is identified by its context. (see https://sbdevel.wordpress.com/2017/02/02/automated-improvement-of-search-in-low-quality-ocr-using-word2vec/)

Download and more information about the Word2Vec dictionaries:

Download

 

Online demo of the two corpora: Word2Vec demo

 

 

 

 

 

 

Posted in Uncategorized | Leave a comment

SolrWayback software bundle has been released

The SolrWayback software bundle can be used to search and playback archived webpages in Warc format. It is an out of the box solution with index workflow, Solr and Tomcat webserver and a free text search interface with playback functionality. Just add your Warc to a folder and start the index job.

The search interface has additional features besides freetext search. This includes:

  • Image search similar to google images
  • Search by uploading a file. (image/pdf etc.) See if the resource has been harvested and from where.
  • Link graph showing links (ingoing/outgoing) for domains using the D3 javascript framework.
  • Raw download of any harvested resource from the binary Arc/Warc file.
  • Export a search resultset to a Warc-file. Streaming download, no limit of size of resultset.
  • An optional built in SOCKS proxy can be used to view historical webpages without browser leaking resources from the live web.

See the GitHub page for screenshots of SolrWayback and scroll down to the install guide try it out.

Link: SolrWayback

 

solrwayback_search.png

solrwayback_linkgraph.pnggps_exif_search.png

 

Posted in Uncategorized | Leave a comment

Visualising Netarchive Harvests

 

An overview of website harvest data is important for both research and development operations in the netarchive team at Det Kgl. Bibliotek. In this post we present a recent frontend visualisation widget we have made.

From the SolrWayback Machine we can extract an array of dates of all harvests of a given URL. These dates are processed in the browser into a data object containing the years, months, weeks and days to enable us to visualise the data. Futhermore the data is given an activity level from 0-4.

The high-level overview seen below is the year-month graph. Each cell is coloured based on the activity level relative to the number of harvests in the most active month. For now we use a linear calculation so gray means no activity, activity level 1 is 0-25% of the most active month, and level 4 is 75-100% of the most active month. As GitHub fans, we have borrowed the activity level colouring from the user commit heat map.

1-overview-no-url

 

We can visualise a more detailed view of the data as either a week-day view of the whole year, or as a view of all days since the first harvest. Clicking one of these days reveals all the harvests for the given day, with links back to SolrWayback to see a particular harvest.

2-year-week-no-url

 

In the graph above we see the days of all weeks of 2009 as vertical rows. The same visualisation can be made for all harvest data for the URL, as seen below (cut off before 2011, for this blog post).

3-all-years-no-url

 

There are both advantages and disadvantages to using the browser-based data calculation. One of the main advantages is a very portable frontend application. It can be used with any backend application that outputs an array of dates. The initial idea was to make the application usable for several different in-house projects. Drawbacks to this approach is, of course, the scalability. Currently the application processes 25.000 dates in about 3-5 seconds on the computer used to develop the application (a 2016 quad core Intel i5).

The application uses the frontend library VueJS and only one other dependency, the date-fns library. It is completely self-contained and it is included in a single script tag, including styles.

Ideas for further development.

We would like to expand this to also include both:

  1. multiple URLs, which would be nice for resources that have changed domain, subdomain or protocol over time, e.g. the URL http://pol.dk, http://www.pol.dk and https://politiken.dk could be used for the danish newspaper Politiken.
  2. domain visualisation for all URLs on a domain. A challenge here will of course be the resources needed to process the data in the browser. Perhaps a better calculation method must be used – or a kind of lazy loading.
Posted in Blogging, Solr, Web | Tagged , | Leave a comment