Prebuild Big Data Word2Vec dictionaries

                   Prebuild and trained Word2Vec dictionaries ready for use

Two different prebuild big data Word2Vec dictionaries has been added to LOAR (Library Open Access Repository) for download. These dictionaries are build from the text of 55,000 e-books from Project Gutenberg and 32.000.000 Danish newspaper pages.

35.000 of the Gutenberg e-books are English, but over 50 different languages are present in the dictionaries. Even though they are different languages the Word2Vec algorithm did a good job of separating the different languages so it is almost like 50 different Word2Vec dictionaries.

The text from the danish newspapers is not public available so you would not be able to build this dictionary yourself. A total of 300Gb of raw text went into building the dictionary, so it is probably the largest Word2Vec dictionary build on a Danish corpus. Since the danish newspapers suffer from low quality OCR, many of words in the dictionary are misspellings. Using this dictionary it was possible to fix many of the OCR errors due the nature of the Word2Vec algorithm, since a given word appears in similar contexts despite its misspellings and is identified by its context. (see https://sbdevel.wordpress.com/2017/02/02/automated-improvement-of-search-in-low-quality-ocr-using-word2vec/)

Download and more information about the Word2Vec dictionaries:

Download

 

Online demo of the two corpora: Word2Vec demo

 

 

 

 

 

 

Advertisements
Posted in Uncategorized | Leave a comment

SolrWayback software bundle has been released

The SolrWayback software bundle can be used to search and playback archived webpages in Warc format. It is an out of the box solution with index workflow, Solr and Tomcat webserver and a free text search interface with playback functionality. Just add your Warc to a folder and start the index job.

The search interface has additional features besides freetext search. This includes:

  • Image search similar to google images
  • Search by uploading a file. (image/pdf etc.) See if the resource has been harvested and from where.
  • Link graph showing links (ingoing/outgoing) for domains using the D3 javascript framework.
  • Raw download of any harvested resource from the binary Arc/Warc file.
  • Export a search resultset to a Warc-file. Streaming download, no limit of size of resultset.
  • An optional built in SOCKS proxy can be used to view historical webpages without browser leaking resources from the live web.

See the GitHub page for screenshots of SolrWayback and scroll down to the install guide try it out.

Link: SolrWayback

 

solrwayback_search.png

solrwayback_linkgraph.pnggps_exif_search.png

 

Posted in Uncategorized | Leave a comment

Visualising Netarchive Harvests

 

An overview of website harvest data is important for both research and development operations in the netarchive team at Det Kgl. Bibliotek. In this post we present a recent frontend visualisation widget we have made.

From the SolrWayback Machine we can extract an array of dates of all harvests of a given URL. These dates are processed in the browser into a data object containing the years, months, weeks and days to enable us to visualise the data. Futhermore the data is given an activity level from 0-4.

The high-level overview seen below is the year-month graph. Each cell is coloured based on the activity level relative to the number of harvests in the most active month. For now we use a linear calculation so gray means no activity, activity level 1 is 0-25% of the most active month, and level 4 is 75-100% of the most active month. As GitHub fans, we have borrowed the activity level colouring from the user commit heat map.

1-overview-no-url

 

We can visualise a more detailed view of the data as either a week-day view of the whole year, or as a view of all days since the first harvest. Clicking one of these days reveals all the harvests for the given day, with links back to SolrWayback to see a particular harvest.

2-year-week-no-url

 

In the graph above we see the days of all weeks of 2009 as vertical rows. The same visualisation can be made for all harvest data for the URL, as seen below (cut off before 2011, for this blog post).

3-all-years-no-url

 

There are both advantages and disadvantages to using the browser-based data calculation. One of the main advantages is a very portable frontend application. It can be used with any backend application that outputs an array of dates. The initial idea was to make the application usable for several different in-house projects. Drawbacks to this approach is, of course, the scalability. Currently the application processes 25.000 dates in about 3-5 seconds on the computer used to develop the application (a 2016 quad core Intel i5).

The application uses the frontend library VueJS and only one other dependency, the date-fns library. It is completely self-contained and it is included in a single script tag, including styles.

Ideas for further development.

We would like to expand this to also include both:

  1. multiple URLs, which would be nice for resources that have changed domain, subdomain or protocol over time, e.g. the URL http://pol.dk, http://www.pol.dk and https://politiken.dk could be used for the danish newspaper Politiken.
  2. domain visualisation for all URLs on a domain. A challenge here will of course be the resources needed to process the data in the browser. Perhaps a better calculation method must be used – or a kind of lazy loading.
Posted in Blogging, Solr, Web | Tagged , | Leave a comment

SolrWayback Machine

Another ‘google innovation week’ at work has produced the SolrWayback Machine. It works similar to the Internet Archive: Wayback Machine (https://archive.org/web/) and can be used to show harvested web content (Warc files).  The Danish Internet Archive has over 20billion harvested web objects and takes 1 Petabyte of storage.

The SolrWayback engine require you have indexed the Warc files using the Warc-indexer tool from British Library. (https://github.com/ukwa/webarchive-discovery/tree/master/warc-indexer).

It is quite fast and comes with some additional features as well:

  •  Image search similar to google images
  •  Link graphs showing  links (ingoing/outgoing) for domains using the D3 javascript framework.
  •  Raw download of any harvested resource from the binary Arc/Warc file.

Unfortunately  the collection is not available for the public so I can not show you the demo. But here is a few pictures from the SolrWayback machine.

solrwayback_demo

solrwayback_linkgraph.png

SolrWayback at GitHub: https://github.com/netarchivesuite/solrwayback/

Posted in Uncategorized | 1 Comment

juxta – image collage with metadata

Creating large collages of images to give a bird’s eye view of a collection seems to be gaining traction. Two recent initiatives:

Combining those two ideas seemed like a logical next step and juxta was born: A fairly small bash-script for creating million-scale collages of images, with no special server side.  There’s a small (just 1000 images) demo at SBLabs.

Presentation principle

The goal is to provide a seamless transition from the full collection to individual items, making it possible to compare nearby items with each other and locate interesting ones. Contextual metadata should be provided for general information and provenance.

Concretely, the user is presented with all images at once and can zoom in to individual images in full size. Beyond a given threshold, metadata are show for the image currently under the cursor, or finger if a mobile device is used. An image description is displayed just below the focused image, to avoid disturbing the view. A link to the source of the image is provided on top.

kob1_crop

Overview of historical maps

kob2_crop

Meta-data for a specific map

Technical notes, mostly on scaling

On the display side, OpenSeadragon takes care of the nice zooming. When the user moves the focus, a tiny bit of JavaScript spatial math resolves image identity and visual boundaries.

OpenSeadragon uses pyramid tiles for display and supports the Deep Zoom protocol can be implemented using only static files. The image to display is made up of tiles of (typically) 256×256 pixels. When the view is fully zoomed, only the tiles within the viewport are requested. When the user zooms out, the tiles from the level above are used. The level above is half the width and half the height and is thus represented by ¼ the amount of tiles. And so forth.

Generating tiles is heavy

A direct way of creating the tiles is

  1. Create one large image of the full collage (ImageMagick’s montage is good for this)
  2. Generate tiles for the image
  3. Scale the image down to 50%×50%
  4. If the image is larger than 1×1 pixel then goto 2

Unfortunately this does not scale particularly well. Depending on size and tools, it can take up terabytes of temporary disk space to create the full collage image.

By introducing a size constraint, juxta removes this step: All individual source images are scaled & padded to have the exact same size. The width and height of the images are exact multiples of 256. Then the tiles can be created by

  1. For each individual source image, scale, pad and split the image directly into tiles
  2. Create the tiles at the level above individually by joining the corresponding 4 tiles below and scale to 50%×50% size
  3. If there are more than 1 tile or that tile is larger than 1×1 pixel then goto 2

As the tiles are generated directly from either source images or other tiles, there is no temporary storage overhead. As each source image and each tile are processed individually, it is simple to do parallel processing.

Metadata takes up space too

Displaying image-specific metadata is simple when there are just a few thousand images: Use an in-memory array of Strings to hold the metadata and fetch it directly from there. But when the number of images goes into the millions, this quickly becomes unwieldy.

juxta groups the images spatially in buckets of 50×50 images. The metadata for all the images in a bucket are stored in the same file. When the user moved the focus to a new image, the relevant bucket is fetched from the server and the metadata are extracted. A bucket cache is used to minimize repeat calls.

Most file systems don’t like to hold a lot of files in the same folder

While the limits differ, common file systems such as ext, hfs & ntfs all experience performance degradation with high numbers of files in the same folder.

The Deep Zoom protocol in conjunction with file-based tiles means that the amount of files at the deepest zoom level is linear to the number of source images. If there are 1 million source images, with full-zoom size 512×512 pixels (2×2 tiles), the number of files in a single folder will be 2*2*1M = 4 million. Far beyond the comfort-zone fo the mentioned file systems (see the juxta readme for tests of performance degradation).

juxta mitigates this by bucketing tiles in sub-folders. This ensures linear scaling of build time at least up to 5-10 million images. 100 million+ images would likely deteriorate build performance markedly, but at that point we are also entering “is there enough free inodes on the file system?” territory.

Unfortunately the bucketing of the tile files is not in the Deep Zoom standard. With OpenSeadragon, it is very easy to change the mapping, but it might be more difficult for other Deep Zoom-expecting tools.

Some numbers

Using a fairly modern i5 desktop and 3 threads, generating a collage of 280 5MPixel images, scaled down to 1024×768 pixels (4×3 tiles) took 88 seconds or about 3 images/second. Repeating the experiment with a down-scale to 256×256 pixels (smallest possible size) raised the speed to about 7½ image/second.

juxta comes with a scale-testing script that generates sample images that are close (but not equal) to the wanted size and repeats them for the collage. With this near-ideal match, processing speed was 5½ images/second for 4×3 tiles and 33 images/second for 1×1 tiles.

The scale-test script has been used up to 5 million images, with processing time practically linear to the number of images. At 33 images/second that is 42 hours.

Posted in Uncategorized | Leave a comment

Automated improvement of search in low quality OCR using Word2Vec

This abstract has been accepted for Digital Humanities in the Nordic Countries 2nd Conference, http://dhn2017.eu/

In the Danish Newspaper Archive[1] you can search and view 26 million newspaper pages. The search engine[2] uses OCR (optical character recognition) from scanned pages but often the software converting the scanned images to text makes reading errors. As a result the search engine will miss matching words due to OCR error. Since many of our newspapers are old and the scans/microfilms is also low quality, the resulting OCR constitutes a substantial problem. In addition, the OCR converter performs poorly with old font types such as fraktur.

One way to find OCR errors is by using the unsupervised Word2Vec[3] learning algorithm. This algorithm identifies words that appear in similar contexts. For a corpus with perfect spelling the algorithm will detect similar words synonyms, conjugations, declensions etc. In the case of a corpus with OCR errors the Word2Vec algorithm will find the misspellings of a given word either from bad OCR or in some cases journalists. A given word appears in similar contexts despite its misspellings and is identified by its context. For this to work the Word2Vec algorithm requires a huge corpus and for the newspapers we had 140GB of raw text.

Given the words returned by Word2Vec we use a Danish dictionary to remove the same word in different grammatical forms. The remaining words are filtered by a similarity measure using an extended version of Levenshtein distance taking the length of the word and an idempotent normalization taking frequent one and two character OCR errors into account.

Example: Let’s say you use the Word2Vec to find words for banana and it returns: hanana, bananas, apple, orange. Remove bananas using the (English) dictionary since this is not an OCR error. For the three remaining words only hanana is close to banana and it is thus the only misspelling of banana found in this example. The Word2Vec algorithm does not know how a words is spelled/misspelled, it only uses the semantic and syntactic context.

This method is not an automatic OCR error corrector and cannot output the corrected OCR. But when searching it will appear as if you are searching in an OCR corrected text corpus. Single word searches on the full corpus gives an increase from 3% to 20% in the number of results returned. Preliminary tests on the full corpus shows only relative few false positives among the additional results returned, thus increasing recall substantially without a decline in precision.

The advantage of this approach is a quick win with minimum impact on a search engine [2] based on low quality OCR. The algorithm generates a text file with synonyms that can be used by the search engine. Not only single words but also phrase search with highlighting works out of the box. An OCR correction demo[4] using Word2Vec on the Danish newspaper corpus is available on the Labs[5] pages of The State And University Library, Denmark.

[1] Mediestream, The Danish digitized newspaper archive.
http://www2.statsbiblioteket.dk/mediestream/avis

[2] SOLR or Elasticsearch etc.

[3] Mikolov et al., Efficient Estimation of Word Representations in Vector Space
https://arxiv.org/abs/1301.3781

[4] OCR error detection demo (change word parameter in URL)
http://labs.statsbiblioteket.dk/dsc/ocr_fixer.jsp?word=statsminister

[5] Labs for State And University Library, Denmark
http://www.statsbiblioteket.dk/sblabs/

 

Posted in Uncategorized | 2 Comments

70TB, 16b docs, 4 machines, 1 SolrCloud

At Statsbiblioteket we maintain a historical net archive for the Danish parts of the Internet. We index it all in Solr and we recently caught up with the present. Time for a status update. The focus is performance and logistics, not net archive specific features.

Hardware & Solr setup

Search hardware is 4 machines, each with the same specifications & setup:

  • 16 true CPU-cores (32 if we count Hyper Threading)
  • 384GB RAM
  • 25 SSDs @ 1TB (930GB really)

Each machine runs 25 Solr 4.10 instances @ 8GB heap, each instance handling a single shard on a dedicated SSD. Except for machine 4 that only has 5 shards, because it is being filled. Everything coordinated by SolrCloud as a single collection.

netarchive_search_overview_20161129

Netarchive SolrCloud search setup

As the Solrs are the only thing running on the machines, it follows that there are at least 384GB-25*8GB = 184GB free RAM for disk cache on each machine. As we do not specify Xms, this varies somewhat, with 200GB free RAM on last inspection. As each machine handles 25*900GB = 22TB index data, the amount of disk cache is 200GB/22TB = 1% of index size.

Besides the total size of 70TB / 16 billion documents, the collection has some notable high-cardinality fields, used for grouping and faceting:

  • domain & host: 16 billion values / 2-3 million unique
  • url_norm: 16 billion values / ~16 billion unique
  • links: ~50 billion values / 20 billion+ unique

Workload and performance

The archive is not open to the public, so the amount of concurrent users is low, normally just 1 or 2 at a time. There are three dominant access patterns:

  1. Interactive use: The typical request is a keyword query with faceting on a 4-6 fields (domain, year, mine-type…), sometimes grouped on url_norm and often filtered on one or more of the facets.
  2. Corpus probing: Batch extraction jobs, such as using the stats component for calculating the size of all harvested material, for a given year, for all harvested domains separately.
  3. Lookup mechanism for content retrieval: Very experimental and used similarly to CDX-lookups + Wayback display. Such lookups are searches for 1-100 url_field:url pairs, OR’ed together, grouped on the url_field and sorted on temporal proximity to a given timestamp.

Due to various reasons, we do not have separate logs for the different scenarios. To give an approximation of interactive performance, a simple test was performed: Extract all terms matching more that 0.01% of the documents, use those terms to create fake multi-term queries (1-4 terms) and perform searches for the queries in a single thread.

compare_no-facets_sparse-facets_20161130

Non-faceted (blue) and faceted (pink) search in the full net archive, bucketed by hit count

The response times for interactive use lies within our stated goal of keeping median response times < 2 seconds. It is not considered a problem that queries with 100M+ hits takes a few more seconds.

The strange low median for non-faceted search in the 10⁸-bucket is due to query size (number of terms in the query) impacting search-speed. The very fast single-term queries dominates this bucket as very few multi-term queries gave enough hits to land in the bucket. The curious can take a look at the measurements, where the raw test result data are also present.

Morale: Those are artificial queries and should only be seen as a crude approximation of interactive use. More complex queries, such as grouping on billions of hits or faceting on the links-field, can take minutes. The so-far-discovered extremely worst-case is 30 minutes.

Secret sauce

  1. Each shard is fully optimized and the corpus is extended by adding new shards, which are build separately. The memory savings from this are large: No need for the extra memory needed for updating indexes (which requires 2 searchers to be temporarily open at the same time), no need for large segment→ordinal maps for high-cardinality faceting.
  2. Sparse faceting means lower latency, lower memory footprint and less GC. To verify this, the performance test above was re-taken with vanilla Solr faceting.
compare_sparse-facets_solr-facets_20161130

Vanilla Solr faceting (blue) and sparse faceting (pink) search in the full net archive, bucketed by hit count

Lessons learned so far

  1. Memory. We started out with 256GB of RAM per machine. This worked fine until all the 25 Solr JVMs(machine had expanded up to their 8GB Xmx, leaving ~50GB or 0.25% of the index size free for disk cache. At that point the performance tanked, which should not have come as a surprise as we tested this scenario nearly 2 years ago. Alas, quite foolishly we had relied on the Solr JVMs not expanding all the way up to 8GB.Upping the memory per machine to 384GB, leaving 200GB or 1% of index size free for disk cache ensured that interactive performance was satisfactory.
    An alternative would have been to lower the heap for each Solr. The absolute minimum heap for our setup is around 5GB per Solr, but that setup is extremely vulnerable to concurrent requests or memory heavy queries. To free enough RAM for satisfactory disk caching, we would have needed to lower the heaps to 6GB, ruling out faceting on some of the heavier fields and in general having to be careful about the queries issued. Everything works well with 8GB, with the only Out Of Memory incidents having been due to experiment-addicted developers (aka me) issuing stupid requests.
  2. Indexing power: Practically all the work on indexing is being done in the Tika-analysis phase. It took about 40 CPU-core years to build the current Version 2 of the index; in real-time it took about 1 year. Fortunately the setup scales practically linear, so next time we’ll try to allocate 12 power houses for 1 month instead of 1 machine for 12 months.
  3. Automation: The current setup is somewhat hand-held. Each shard is constructed by running a command, waiting a bit less than a week, manually copying the constructed shard to the search cloud and restarting the cloud (yes, restart).In reality it is not that cumbersome, but a lot of time was wasted with the indexer having finished, with noone around to start the next batch. Besides, the excessively separated index/search setup means that the content currently being indexed into an upcoming shard cannot be searched.

Looking forward

  1. Keep the good stuff: We are really happy about the searchers being non-mutable and on top of fully optimized shards.
  2. Increase indexing power: This is a no-brainer and “only” a question of temporary access to more hardware.
  3. Don’t diss the cloud: Copying raw shards around and restarting the cloud is messy. Making each shard a separate collection would allow us to use the collections API for moving them around and an alias to access it all as a single collection.
  4. Connect the indexers to the searchers: As the indexers only handle a single shard at a time, they are fairly easy to scale so that they can also function as searchers for the shards being build. The connection is trivial to create if #3 is implemented.
  5. Upgrade the components: Solr 6 will give us JSON faceting, Streaming, SQL-ish support, graph traversal and more. These aggregations would benefit both interactive use and batch jobs.We could do this by upgrading the shards with Lucene’s upgrade tool, but we would rather perform a whole re-index as we have also need better data in the index. A story for another time.

 

Posted in Hacking, Low-level, Performance, Solr, Statsbiblioteket, Uncategorized | 6 Comments