Author Archives: Toke Eskildsen

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.

juxta – image collage with metadata

Creating large collages of images to give a bird’s eye view of a collection seems to be gaining traction. Two recent initiatives: The New York Public Library has a very visually pleasing presentation of public domain digitizations, but with a … Continue reading

Posted in Uncategorized | Leave a comment

70TB, 16b docs, 4 machines, 1 SolrCloud

At Statsbiblioteket we maintain a historical net archive for the Danish parts of the Internet. We index it all in Solr and we recently caught up with the present. Time for a status update. The focus is performance and logistics, … Continue reading

Posted in Hacking, Low-level, Performance, Solr, Statsbiblioteket, Uncategorized | 6 Comments

CDX musings

This is about web archiving, corpus creation and replay of web sites. No fancy bit fiddling here, sorry. There is currently some debate on CDX, used by the Wayback Engine, Open Wayback and other web archive oriented tools, such as … Continue reading

Posted in Uncategorized | Leave a comment

Faster grouping, take 1

A failed attempt of speeding up grouping in Solr, with an idea for next attempt. Grouping at a Statsbiblioteket project We have 100M+ articles from 10M+ pages belonging to 700K editions of 170 newspapers in a single Solr shard. It … Continue reading

Posted in Uncategorized | Leave a comment

The ones that got away

Two and a half ideas of improving Lucene/Solr performance that did not work out. Track the result set bits At the heart of Lucene (and consequently also Solr and ElasticSearch), there is a great amount of doc ID set handling. … Continue reading

Posted in eskildsen, Hacking, Low-level, Lucene, open source, Performance, Solr | Leave a comment

Speeding up core search

Issue a query, get back the top-X results. It does not get more basic with Solr. So great win if we can improve on that, right? Truth be told, the answer is still “maybe”, but read on for some thoughts, … Continue reading

Posted in eskildsen, Hacking, Low-level, Lucene, open source, Performance, Solr, Uncategorized | 1 Comment

Light validation of Solr configuration

This week we were once again visited by the Edismax field alias bug in Solr: Searches with boosts, such as foo^2.5, stopped working. The problem arises when an alias with one or more non-existing fields is defined in solrconfig.xml and … Continue reading

Posted in eskildsen, Solr | Leave a comment