First timers – code4lib day 1

First timers, please raise your hands! And the room was full of raised hands.

The time is 9.00, the conference has started and everybody’s alert and ready to participate. The atmosphere is great. We’ll add notes throughout the day, so stay tuned. Program at

Edit 9.01 – some slide notes

  • No suits
  • No sanctimony
  • No pretension
  • Humor
  • Self-deprecation
  • Having fun
  • The presenter owns a thong
  • code4lib participants likes cake

Edit 9.28, 9.49, 10.00 – keynote by Stefano Mazzocchi

About the historical transition of information media with focus on cons and marginal cost for reproduction:

Speech → Cave → Clay → Fiber → Printed fiber -> Electronic publishing

Problems now & future:

  • Degrading consumption experience (resolution, batteries, poor user interface, poor network access)
  • Disrupted business models
  • Disrupted institutions

~0 marginal costs are here to stay (making copying illegal doesn’t change anything). Electronic books a’la iPod will come. It’s uneconomical to manage physical copies of non-unique copies of books. Libraries are becoming museums of unique books. No more shelfs.

  • What happens to serendipitous discovery? Browsing vs. search.
  • What happens with near infinite storage space? Why do we filter anymore?
  • When indices are created from full text, do you still need meta data? Does it need to be created by humans?

Information is fragmenting: Books → pages, journals of articles, web pages, networks of relational assertions. Can the library mind set continue to work with the new granularity? A web of data is coming.

Demo: Freebase Parallax was presented as a way to answer linked questions (“Who was the russian actor that played in a movie with Al Pacino?”), by navigating sets of data through faceting and search.

How do we get people to add knowledge to our systems? Use games like Image Tag. Freebase made Typewriter, where users extract information from text snippets. Users like to work with it and gets obsessive about extracting information (and get higher scores?). Huge success in a short amount of time. Acre allows users to create information extraction games.

Edit 10.20 – Why libraries should embrace Linked Data by Anders Söderbäck (LIBRIS)

MARC21 since 2001, Linked Data since 2008. “A library catalog must be designed by considering it in the context of the web“. Data first, use vocabulatories if possible and make up the rest. Work in progress, hoping for network effect etc. It’s kinda boring, but clearly very useful and we currently suck as this.

Toke hacking away at his Mac-clone

Toke hacking away on his Mac-clone

Edit 10.50 – Like a can opener for your data silo: simple access through AtomPub and Jangle

Inside the library world: SRU not widespread, OAI-PMH too simple, DLF ILS-DI API best of breed. Net effect is fragmented. Outside the library world: Atom. AtomPub uses this with REST. Broad client support, broad company support (Google, Microsoft, IBM).

Jangle: Library-relates resources over extended Atom (backwards compatible to plain Atom). An in-between system. Jangle vocabulary with entities, actors, collections etc.

Presenter talks way too fast for this note taker to write everything down, sorry. We need to get this guy in a corner and talk details.

Edit 11.00 – LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene (Glen Newton)

Glen Newton

Glen Newton at code4lib 2009

Large full text collection. Heavy text mining, relations discovering. Lucene backend, several 100GB daily builds. They experimented with SOLR, DBSight, Lucen4DB, Hibernate Search, Compass. Overly complicates, not fast enough etc.

User knowledge requirement: SQL, their own tables etc. Nothing else. Basically SQL on Lucene (as the name implies). Focus on parallel processing, fast indexing, low heap overhead. Filter setup input and output. Batch-oriented at the core. Can append, but … no, it’s batch.

They do a lot of work that seems applicable to Summa (and the other way around for that matter). Mikkel afkes about incremental updates in the evil way “Have you experimented with incremental updates, say new data each 30 seconds?“.

Edit 11.20 – RESTafarian-ism at the NLA

Terence Ingram at code4lib 2009

Terence Ingram at code4lib 2009


  • Simple
  • Language neutral
  • Not SOAP

It’s too much effort“-driven development. Talk, talk, talk. SOA considered harmful? Cascading errors hard to handle and debug. Over-engineering, extreme MVC. They build an extremely complicated fine grained SOA setup that they are now tearing down and simplifying.

They did get some useful things done: Copyright status resolving to machine format and identity service.

Questions: Versioning of API? Answer: No real control, just upgrade and keep eyes open. Worked okay.

Morale: Don’t think too much, just do it!

Edit 11.40 – The Dashboard Initiative (Birkin James Diana)

Birkin James Diana at code4lib 2009

Birkin James Diana at code4lib 2009

How to communicate data flow for managers and users. Car dashboard as inspiration: Current status only, no details. A set of simple widgets is provided – learn them as they are re-used for everything. The widgets provide clean visualisation but that only solves half the problem: Managers still aggregate data (using Excell spread sheets and similar awful methods).

Managers get an administration interface, where they can customize what and how to display widgets. This might solve the aggregate/re-use problem. The widgets are clickable leading to detailed information, raw data etc.

The whole system seems simple (maybe a bit too simple?) and it is something we’ve talked about making for ourselves for some time. The system has a fair interface but could maybe use some more work. Made for management (and that’s ok).

Edit 12.00 – Lunch (yay!)

Cranberry mayonnaise!? WTF!?

Edit 13.00 – Open Up Your Repository With a SWORD! (Ed Summers & Mike Giarlo)

Mike Giarlo & Ed Summers at code4lib 2009

Mike Giarlo & Ed Summers at code4lib 2009

Simple Webservice Offering Repository Deposit. Cue sound of lightsabres by iPhone-app.

Normally: OAI-PMH / OAI-ORE. Fine for getting things out, but how to get in? SWORD answers that by acting as an in-between layer with repositories. Atom is used for simple DET, PUT, DELETE with SWORD-defined vocabulary and definitions on accepted formats, collections etc. defined by service documents from the repository.

This looks extremely relevant for our DOMS-people as it is supported by Fedora, DSpace et al.

Edit 13.20 – how I failed to present on using DVCS for archival metadata (Mark A. Matienzo)

Mark A. Matienzo

Mark A. Matienzo

Quote: I failed, epically.

DVCS in general was great. Mercurial was chosen as test case as it was simple and fast. git was too hard to grasp. bzr was slightly more complicated than Mercurial, monotone used its own protocol and had a rat as icon. Workflow was easy. Diffing and patching was the hard part.

Diff is line-based, which collides with metadata that is hierarchical. XML-diff was too complicated to grasp. Canonical XML seemed interesting as it normalises the order of elements and attributes, helping with diff. ssddiff, xdiff, logilab xmldif, ladiff, xydiff, xmlunit, deltaxml, microsoft xmldiff, xml treediff, sun diffmk are choices. Overwhelming in numbers and complexity.

Patches are not standardized. Names are xupdate, deltaxml, logilab edit script (not XML), other script formats. No good implementations are good, it’s all lock-in. This is fine for hackers, but a problem for non-hackers. There’s no inter-operability, they are hard to understand. Visualization is hard.

Mikkel asked why Mark didn’t just use the defaults. The answer was that Mark wanted to investigate, not just use the products.

Edit 13.40 – LibX 2.0 – Godmar Back

Godmar Back

Godmar Back

A browser-plugin for libraries. Customizable by libraries, with a system that allows for creation and sharing of setups. The toolbar is popular and works, but the future presents SOA, mash-ups, online tutorials, subject guides, visualization, social OPACS etc.

The LibX-people want to solve this by making a shared layer on top of non-controlled web pages, where users can merge library applications into pages. The layer can either be additions or replacements of elements in web pages. It uses Jangle & JavaScript and puts implementations into formalized Modules. Modules are trusted with full access to the machine and LibX-libraries.  Tuple Space is used for distributed communication.

This looks a lot like Greasemonkey in functionality, but is very flexible and with Tuple Space as a significant extra.

LibX 2.0: Browser independent code, solidified with documentation and unit-tests. A hierarchy of users is expected with developers, adapters and users of modules. Community-oriented tools are being created over the next 2 years.

Question: Is it okay to change other people’s web pages? Answer: There is a white list, but the question is inherently hard and open.

Question: What about performance on large pages? Answer: The architecture works with timeouts and other care is being taken to ensure performance.

Edit 14.05 – djatoka for dummies

Kevin S. Clarke and John Fereira

Kevin S. Clarke and John Fereira

Two guys doing a presentation on how they learned and used aDORe djatoka. djatoka is an Open Source JPEG 2000 image server. URL-based image dissemination platform (handles to- and from-conversion between formats other than JPEG 2000).

  • Local file caching with Java API
  • Tile caching
  • OpenURL (id, region, format, rotate, level)

Issues: Installation was simple, djatoka is Open Source, but Kakadu (handles compression) is not. Kakadu is provided as binaries. Images are recognized by file extensions (upper case). The viewer works best by AJAX, but must reside on the TomCat.

The presenters showed some demos that indicates that this can be used in a manner like Google Maps et al. Limited experience from production environments are positive. This is probably better suited for Det Kongelige Bibliotek as they handle large images.

Edit 14.50 – Break-out session on SOLR

A lot of miscellaneous talk on SOLR and a bit on Lucene in general. Toke geeked out on term-statistics and Lucene search result merging with relevance preservation on a totally inappropriate manner. Sigh… Anyhoo, it turns out that SOLR’s upcoming faceting system is 100% memory-based (and allegedly fast), working on a principle quite similar to the one Summa’s been using for the last years, whith the exception that it is (assumable) faster at startup. Toke of course jumped the gun on his talk tomorrow and talked about memory usage, probably offending people with the way he asked about it. Sigh again… But people are nice, ro retaliations.

There was also a bit of talk about performance in general with the chance to peddle Flash-SSD’s. Maybe a lightning talk would be in order?

Edit 14.50 – Break-out session on LibX 2.0
Mads and Jørn went to the break-out session on LibX 2.0. However it more or less turned into a Q&A with Mads doing a lot of the Q’ing.

All in all LibX looks to be Greasemonkey (and possibly a bit of Ubiquity) done right (or at least better).

LibX 2.0 is currently Firefox-only, but will be ported to Internet Explorer as well – there are vague plans about porting it to Safari and Chrome (both being Webkit-based), however none of those have a usable extension API.

The various client side modules in LibX communicate using Tuple Space. A module will by default consume any message it receives, and can optionally put it back for others modules to use as well.

Edit 16.00 – Lightning talks

A barrage of interesting and/or wacked-out demos, slides, cartoons, talking and inspiration. Totally untranscribable, highly recommended.

About Toke Eskildsen

IT-Developer at with a penchant for hacking Lucene/Solr.
This entry was posted in Conference. Bookmark the permalink.

One Response to First timers – code4lib day 1

  1. Hans says:

    Day 2 ??????

    Did Toke survived?
    Did you all survived?
    Are you coming back?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s