Archive for February, 2009

Lightning Talks

February 26, 2009

Wednesday Toke delivered a lightning talk about the marvels of solid stats drives. Unfortunately Toke had to use to tries as his images (graphs) were not displayed on the conference MAC the first time. People were clearly impressed by the speed of these things and the next presenter actually mentioned that he will look into SSD for his project (Hathitrust).

Toke doing his lightning talk

Toke doing his lightning talk

Mikkel has just delivered his lightning talk on Summa and he enticed the crowd with his storytelling approach and cool graphics on the slides.

Mikkel doing his lightning talk

Mikkel doing his lightning talk

It’s been great, just super. Hans, we’re sending at least two guys next year, you just need to find the money!

And now… New York!

Code4Lib day 3

February 26, 2009

Ian Davis the Talis CTO held the keynote this Thursday morning. The topic was information sharing in the past and in the future. Ian claimed that the the success of the web was not just about free (free as in free beer) information sharing, but also very much based on the ability to link documents.

I must admit that it felt a bit like he was preaching to the choir. The main points was that we should open up data and enable a rich semantic web of linked data. Hard to disagree wit :-)

One point that that I feel worth stressing was when he put up a conjecture on his slides: Conjecture: Data outlasts code. This leads to the following: Corollary: Open data is more important than open source. I am not sure that I necessarily agree, but it is food for thought.

Next on was Edward M. Corrado the head of librarytechnology on Birmingham University. The topic of his talk was the “Open Platform Strategy”. At the core of it I think his main point was that vendors are starting to open up, and even though it is not completely open source we should embrace their initiative (OCLC and Ex Libris are both doing this). It seems to me as a very pragmatic approac (and I love that!), but also a road that could lead to vendor lock-in.

Seen in the light of Sebastian’s (Index Data) thunder talk for open standards yesterday I guess Edward and Sebastian could have a heated debate… :-)

A modern open webservice-based GIS infrastructure” by Adam Soroka & Bess Sadler

GIS systems needs special data repositories because GIS data is weird:

  • The data is huge (terabyte scale)
  • It lives in odd formats
  • It requires special tools for use
  • It deserves special descriptions

Open geospatial consortium produces standards (not tools) for GIS data. There exists sevaral standards for GIS data:

  • ISO TC 211
  • GML (geographic markup language)
  • ISO 19115 (UML based) / ISO 19139 (standard for serializing 19115 to XML)

Service standards (retrieve data):

  • Web map service – Query is simple key/value and it can return a veriaty of formats (pdf, png, jpg etc.)
  • Web feature service – returns GML (geographic markup language)

Tools needed to get a GIS running:

  • Database
  • Geoserver
  • Geonetwork

“Vizualizing Media Archives: A Case Study” by Chris Beer and Courtney Michael.

Media archives are different. Important to present media data in context. Linked data is used to present data while keeping them in context.

Media archives are visual, and traditional library interfaces are almost all text based. So what they have made is a system to grpahically display images and their relations in a graph – the graph is interactive allowing the user to browse through the data and change their focus.

(Notes by Mikkel, Mads & Jørn)

code4lib day 2

February 26, 2009

(Notes by Mads and Jørn, showeled in by Toke, let’s hope we find the time for proper edits later – we’re running very fast here, but it’s great)

Keynote: Sebastian Hammer

Journals and books will disappear faster than we think.

Maybe we don’t need to follow market forces – sometimes they fail. We need to do more of the boring stuff – ie. less code, more standards.

The libraries need to be more proactive.

What (local) libraries do well:

  • Preservers of cultural heritage
  • Conveyers of authoritative information
  • Supporters of learning and research
  • Pillars of democracy

Libraries need to be the more open alternative to the commercial players – ask questions, put up a fight.

It needs to be a lot easier to put systems together – standards are needed for collaboration.

Systems and organisations need to surrender data freely.

The Open Library Environment
Lifting the ILS up to be a more integrated part of existing system – ie. if there is already an aquisition system, then talk to that, integrate with existing single sign on solution. The is no code yet from the Open Library Environment, but many plans for architectural components.

OLE is defining the core Service-Orientede architecture – but are very interested in feedback, and for that they have put up a survey

http://oleproject.org/2009/02/24/ole-project-related-applications-servey/

A running system is 2 to 3 years away.

Blacklight as a unified discovery platform
“Yet another next-generation catalog”. Very much about serendipity. A lot of the “aha”-moment come when you are browsing – and that is being taken away from the users by just giving them the exact electronic versions, and there is currently nothing there to replace this serendipity.

Solr as a backend, so it has relevancy ranking, facetting, unicode support, etc. All of which are great for the user experience. Blacklight also has permanent URLs, RSS-feeds, and more. Allows RESTful access to their data to make it easier for other people to do mashups.

No single interface can fit all – ie. chemistry students have different need than students from the music department. So they have made specific search

interfaces available to help answer the common questions posed by the different, for example searching by musical instruments. Create new fields when indexing to facilitate searching, for instance based on the year the music was composed they create groups of genres.

Blacklight also contains all the data from their Fedora installation using gsearch so they get live updates to their Solr index.

Items from collections have different behavior, ie. a scanned picture is displayed differently than a scanned book.

Cooperates with Vufind to index marc in Solr (solrmarc), and also do a lot with marc4j. Standardizing on jangle.

A new platform for open data – biblios.net web services – Galen Charlton (LibLime)

  • libraries agree with principles
  • not all have the staff
  • sell sevices

open data – the final frontier

  • libs provide licences but not open dt th cm

open library project

open data commons licence

interface libs for UI’s

biblios.net

  • free browserbased cataloging service
  • pazpar2
  • webservices – to interact with biblios.net – push/pull records
  • push records to create standardized access
  • support for SRU/OpenSearch search
  • multiple formats (XSL transformations) MARC/dc/mods

Extended biblios – the open source web based metadata editor – Chris Catalfo

  • implementing plugins to extend biblios.
  • loads any plugin defined in biblio’s config file
  • example – create editor plugin
  • example – adding Extjs/documents using CouchDB as backend
  • example – listening to biblios.app events

What we talk about when we talk about FRBR – Jodi Schneider

We talk about different things when talking about FRBR

if I refer to a book – different place but no connonical version. We would like one place. Same with lib catalogue – can i get it. FRBR as obout relationships.

Weak idea of FRBR:

  • group manifestation
  • work – manifestation relation
  • work state grouping – xISBN/LibraryThing service (group manifestation service)
  • status of FRBR work set grouping

There is more to it

  • work set grouping does not say anything about contents

Less weak FRBR:

  • open library –> example of work identifier (instead of just grouping)

Strong FRBR

  • we must have items and expressions related back to work and manifestation

example: LC FRBR display tool

  • beyond just works and manifestion

example: VTLS catalogue

  • collections by works/manifestations (all of author)
  • figure /group 1&2&3 enteties (the complete FRBR notion)
  • connect and remix, link up with other systems representaion of FRBR i RDF (IFLA, Davis&Newman), LIBRIS linked data (FRBR related tag), id.loc.gov,

We are going for a (re)usable biblioegraphic universe!

What can we do?

  • demand strong FRBR
  • build linked data (freebase)
  • create the algorithms – share under open license

Complete faceting
Toke held a good and very technical (read super geeky) talk on Summa’s faceting system. Of course some people didn’t understand a word he said (thats not nescessary a bad thing) but those who did get it were very excited. Toke was a very ‘in demand’ person after his talk. We are looking forward to seeing which opportunities this will bring.

The Rising Son – making the most of Solr

performance

  • java is not slow, measure.

memory

  • omitNorms, omitTf

query parsing

  • “it depends”

Data import

  • …DIH, Solr Cell, CSV, LuSql…..
  • SOLR ruby mapper

request handler

Solr as IR toolkit

  • frontend to Lucene

LocalSolr

  • geo searching, submit lat/long queries

Faceting

  • Solr 1.4 perfomance – dramatically advance in performance. Multi select facets.

User interface

  • “the interface is the app”. Abanded the bottom up (data to app). Think app. first and bring it down (app to data).

SolrJS library – standard AJAX components

Lucid articles – tutorials – podcasts – blog

FreeCite – An open source free-text citation parser
Background:
example – Brown searchable database – no citation links
first step is parsing the data to the relevent fields. FreeCite handles this.
FreeCite has a webservice API.
What does FC do?
response: OpenURL – ContextObjetct and JSON. “This is rocket surgery” –> the data is not clean (would be trivial if the data was wellformed). E.g. letters in volume, title and jounal name after each other.

freecite.library.brown.edu

Great facets, like your relevance, but can I have links to Amazon and Google Book Search  – squeezing more out of the OPAC

Goldrush discovery UI, a lot of examples. “But we want more!”. Lots of individual hacking..no similarity and that gives low reuse. We need component like extension. The answer is JUICE. Examples of contents pulled in via Iframes, oplæsning, fadedown integrated on the web site (greasemonkey way).

JUICE (supported by Jquery) –> metadefs –>panels –>extension
- a few JS lines ingested to the page

Why JUICE?

  • easy to copy/paste
  • easy to create new by modifying
  • shares OPAC specific knowledge
  • very little product specific dependencies
  • open source

juice-project.code.google.com

Freebasing for fun and enhancement – Sean Hannan

REST, api.freebase.com. Returns JSON.
Example: Acedemy ward winners: FreeBase schemas defines the fields. Acre – templating. Subquering – open a new JSON block and query again.

First timers – code4lib day 1

February 24, 2009

First timers, please raise your hands! And the room was full of raised hands.

The time is 9.00, the conference has started and everybody’s alert and ready to participate. The atmosphere is great. We’ll add notes throughout the day, so stay tuned. Program at http://www.code4lib.org/conference/2009/schedule

Edit 9.01 – some slide notes

  • No suits
  • No sanctimony
  • No pretension
  • Humor
  • Self-deprecation
  • Having fun
  • The presenter owns a thong
  • code4lib participants likes cake

Edit 9.28, 9.49, 10.00 – keynote by Stefano Mazzocchi

About the historical transition of information media with focus on cons and marginal cost for reproduction:

Speech → Cave → Clay → Fiber → Printed fiber -> Electronic publishing

Problems now & future:

  • Degrading consumption experience (resolution, batteries, poor user interface, poor network access)
  • Disrupted business models
  • Disrupted institutions

~0 marginal costs are here to stay (making copying illegal doesn’t change anything). Electronic books a’la iPod will come. It’s uneconomical to manage physical copies of non-unique copies of books. Libraries are becoming museums of unique books. No more shelfs.

  • What happens to serendipitous discovery? Browsing vs. search.
  • What happens with near infinite storage space? Why do we filter anymore?
  • When indices are created from full text, do you still need meta data? Does it need to be created by humans?

Information is fragmenting: Books → pages, journals of articles, web pages, networks of relational assertions. Can the library mind set continue to work with the new granularity? A web of data is coming.

Demo: Freebase Parallax was presented as a way to answer linked questions (“Who was the russian actor that played in a movie with Al Pacino?”), by navigating sets of data through faceting and search.

How do we get people to add knowledge to our systems? Use games like Image Tag. Freebase made Typewriter, where users extract information from text snippets. Users like to work with it and gets obsessive about extracting information (and get higher scores?). Huge success in a short amount of time. Acre allows users to create information extraction games.

Edit 10.20 – Why libraries should embrace Linked Data by Anders Söderbäck (LIBRIS)

MARC21 since 2001, Linked Data since 2008. “A library catalog must be designed by considering it in the context of the web“. Data first, use vocabulatories if possible and make up the rest. Work in progress, hoping for network effect etc. It’s kinda boring, but clearly very useful and we currently suck as this.

Toke hacking away at his Mac-clone

Toke hacking away on his Mac-clone

Edit 10.50 – Like a can opener for your data silo: simple access through AtomPub and Jangle

Inside the library world: SRU not widespread, OAI-PMH too simple, DLF ILS-DI API best of breed. Net effect is fragmented. Outside the library world: Atom. AtomPub uses this with REST. Broad client support, broad company support (Google, Microsoft, IBM).

Jangle: Library-relates resources over extended Atom (backwards compatible to plain Atom). An in-between system. Jangle vocabulary with entities, actors, collections etc.

Presenter talks way too fast for this note taker to write everything down, sorry. We need to get this guy in a corner and talk details.

Edit 11.00 - LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene (Glen Newton)

Glen Newton

Glen Newton at code4lib 2009

Large full text collection. Heavy text mining, relations discovering. Lucene backend, several 100GB daily builds. They experimented with SOLR, DBSight, Lucen4DB, Hibernate Search, Compass. Overly complicates, not fast enough etc.

User knowledge requirement: SQL, their own tables etc. Nothing else. Basically SQL on Lucene (as the name implies). Focus on parallel processing, fast indexing, low heap overhead. Filter setup input and output. Batch-oriented at the core. Can append, but … no, it’s batch.

They do a lot of work that seems applicable to Summa (and the other way around for that matter). Mikkel afkes about incremental updates in the evil way “Have you experimented with incremental updates, say new data each 30 seconds?“.

Edit 11.20 – RESTafarian-ism at the NLA

Terence Ingram at code4lib 2009

Terence Ingram at code4lib 2009

Why REST?

  • Simple
  • Language neutral
  • Not SOAP

It’s too much effort“-driven development. Talk, talk, talk. SOA considered harmful? Cascading errors hard to handle and debug. Over-engineering, extreme MVC. They build an extremely complicated fine grained SOA setup that they are now tearing down and simplifying.

They did get some useful things done: Copyright status resolving to machine format and identity service.

Questions: Versioning of API? Answer: No real control, just upgrade and keep eyes open. Worked okay.

Morale: Don’t think too much, just do it!

Edit 11.40 – The Dashboard Initiative (Birkin James Diana)

Birkin James Diana at code4lib 2009

Birkin James Diana at code4lib 2009

How to communicate data flow for managers and users. Car dashboard as inspiration: Current status only, no details. A set of simple widgets is provided – learn them as they are re-used for everything. The widgets provide clean visualisation but that only solves half the problem: Managers still aggregate data (using Excell spread sheets and similar awful methods).

Managers get an administration interface, where they can customize what and how to display widgets. This might solve the aggregate/re-use problem. The widgets are clickable leading to detailed information, raw data etc.

The whole system seems simple (maybe a bit too simple?) and it is something we’ve talked about making for ourselves for some time. The system has a fair interface but could maybe use some more work. Made for management (and that’s ok).

Edit 12.00 – Lunch (yay!)

Cranberry mayonnaise!? WTF!?

Edit 13.00 – Open Up Your Repository With a SWORD! (Ed Summers & Mike Giarlo)

Mike Giarlo & Ed Summers at code4lib 2009

Mike Giarlo & Ed Summers at code4lib 2009

Simple Webservice Offering Repository Deposit. Cue sound of lightsabres by iPhone-app.

Normally: OAI-PMH / OAI-ORE. Fine for getting things out, but how to get in? SWORD answers that by acting as an in-between layer with repositories. Atom is used for simple DET, PUT, DELETE with SWORD-defined vocabulary and definitions on accepted formats, collections etc. defined by service documents from the repository.

This looks extremely relevant for our DOMS-people as it is supported by Fedora, DSpace et al.

Edit 13.20 – how I failed to present on using DVCS for archival metadata (Mark A. Matienzo)

Mark A. Matienzo

Mark A. Matienzo

Quote: I failed, epically.

DVCS in general was great. Mercurial was chosen as test case as it was simple and fast. git was too hard to grasp. bzr was slightly more complicated than Mercurial, monotone used its own protocol and had a rat as icon. Workflow was easy. Diffing and patching was the hard part.

Diff is line-based, which collides with metadata that is hierarchical. XML-diff was too complicated to grasp. Canonical XML seemed interesting as it normalises the order of elements and attributes, helping with diff. ssddiff, xdiff, logilab xmldif, ladiff, xydiff, xmlunit, deltaxml, microsoft xmldiff, xml treediff, sun diffmk are choices. Overwhelming in numbers and complexity.

Patches are not standardized. Names are xupdate, deltaxml, logilab edit script (not XML), other script formats. No good implementations are good, it’s all lock-in. This is fine for hackers, but a problem for non-hackers. There’s no inter-operability, they are hard to understand. Visualization is hard.

Mikkel asked why Mark didn’t just use the defaults. The answer was that Mark wanted to investigate, not just use the products.

Edit 13.40 – LibX 2.0 – Godmar Back

Godmar Back

Godmar Back

A browser-plugin for libraries. Customizable by libraries, with a system that allows for creation and sharing of setups. The toolbar is popular and works, but the future presents SOA, mash-ups, online tutorials, subject guides, visualization, social OPACS etc.

The LibX-people want to solve this by making a shared layer on top of non-controlled web pages, where users can merge library applications into pages. The layer can either be additions or replacements of elements in web pages. It uses Jangle & JavaScript and puts implementations into formalized Modules. Modules are trusted with full access to the machine and LibX-libraries.  Tuple Space is used for distributed communication.

This looks a lot like Greasemonkey in functionality, but is very flexible and with Tuple Space as a significant extra.

LibX 2.0: Browser independent code, solidified with documentation and unit-tests. A hierarchy of users is expected with developers, adapters and users of modules. Community-oriented tools are being created over the next 2 years.

Question: Is it okay to change other people’s web pages? Answer: There is a white list, but the question is inherently hard and open.

Question: What about performance on large pages? Answer: The architecture works with timeouts and other care is being taken to ensure performance.

Edit 14.05 – djatoka for dummies

Kevin S. Clarke and John Fereira

Kevin S. Clarke and John Fereira

Two guys doing a presentation on how they learned and used aDORe djatoka. djatoka is an Open Source JPEG 2000 image server. URL-based image dissemination platform (handles to- and from-conversion between formats other than JPEG 2000).

  • Local file caching with Java API
  • Tile caching
  • OpenURL (id, region, format, rotate, level)

Issues: Installation was simple, djatoka is Open Source, but Kakadu (handles compression) is not. Kakadu is provided as binaries. Images are recognized by file extensions (upper case). The viewer works best by AJAX, but must reside on the TomCat.

The presenters showed some demos that indicates that this can be used in a manner like Google Maps et al. Limited experience from production environments are positive. This is probably better suited for Det Kongelige Bibliotek as they handle large images.

Edit 14.50 – Break-out session on SOLR

A lot of miscellaneous talk on SOLR and a bit on Lucene in general. Toke geeked out on term-statistics and Lucene search result merging with relevance preservation on a totally inappropriate manner. Sigh… Anyhoo, it turns out that SOLR’s upcoming faceting system is 100% memory-based (and allegedly fast), working on a principle quite similar to the one Summa’s been using for the last years, whith the exception that it is (assumable) faster at startup. Toke of course jumped the gun on his talk tomorrow and talked about memory usage, probably offending people with the way he asked about it. Sigh again… But people are nice, ro retaliations.

There was also a bit of talk about performance in general with the chance to peddle Flash-SSD’s. Maybe a lightning talk would be in order?

Edit 14.50 – Break-out session on LibX 2.0
Mads and Jørn went to the break-out session on LibX 2.0. However it more or less turned into a Q&A with Mads doing a lot of the Q’ing.

All in all LibX looks to be Greasemonkey (and possibly a bit of Ubiquity) done right (or at least better).

LibX 2.0 is currently Firefox-only, but will be ported to Internet Explorer as well – there are vague plans about porting it to Safari and Chrome (both being Webkit-based), however none of those have a usable extension API.

The various client side modules in LibX communicate using Tuple Space. A module will by default consume any message it receives, and can optionally put it back for others modules to use as well.

Edit 16.00 – Lightning talks

A barrage of interesting and/or wacked-out demos, slides, cartoons, talking and inspiration. Totally untranscribable, highly recommended.

Code4Lib travel tale day 0

February 23, 2009

The trip to the US went quite well. Of course Mikkel decided to give us a little scare by turning up in the last minute while keeping his mobiles switched of. Naturally Mads was pulled aside by the security people in Billund and body searched. Even his shoes were x-rayed. After Mads had cleared security we were of to Amsterdam. Here Toke had a small slip-up with his filled water bottles as we didn’t count on going through security again. This time security decided to pull us aside and interview us as a group. “Are you friends or colleagues?”. “How long have you known each other?”. We answered politely and Toke discarded his water and once again we were airborne.

After a couple of movies and some games we landed in Detroit. Again security literary barked up our tree as they demanded that all passengers “form a single line against the wall”. The purpose of this was luckily enough not to shoot us but to let a dog sniff to us and our bags.

After a short layover with yoghurt and coffee we were of to Providence. Happy to tell we arrived safely and in one piece.

Pics later….

Into the night – code4lib day -1

February 22, 2009

It was a dark and stormy night when three travellers set out on a journey to the land of code for libraries. A dog barked when the wind picked up the remainder of leaves long fallen. Moving in haste by magical metal steed, the companions arrived at the aerial portal. The scrolls revealed that the next chance for a shift in planes was hours away, just as the expected arrival of the fabled Mikkel, he who would prefer to be designated Rebel. The warmth of a humble and conveniently placed inn won over blankets and stars. The day ended and sleep came quickly.

Steely steed

Steely steed awaiting heroic riders

Mat on, mat off

February 17, 2009

A mat materialized in front of the coffee machine and with such disruption of the local tranquility, the office exploded in frantic chatter and predictions of future coffee-drop-driven civilizations in the fabric. None were pleased!

Yes, we know that we are a bunch of ungrateful whiners. Luckily for world peace and office ditto, the offender was promptly removed when our mat oriented disdain was discovered. Thanks, kind and understanding cafeteria people!

documentation for mat removal

The altar at 14:10 and 14:19

code4lib 2009 is next week. Spirits are high and presentation slides are being created. No worries though, none of us are fans of the “Cram 2000 words into each slide”-strategy. In fact, the thought of no slides at all were considered, but quickly discarded – it’s pretty hard to explain multi-level redirecting pointer structures with hand waving. … Come to think of it, it’s also pretty hard to explain by presentation slides. But it’s code4lib. The audience will be bright enough to make sense out of our ramblings.

Usability testing our Summa front-end

February 4, 2009

We are in the midst of conducting a usability test of Search which is our front-end to Summa. The test is a straight forward usability test where 8 users are invited to try out the system and along the way they are provided with different tasks to ensure that they get around most corners of the site. While the users try it out (one at a time) and solve tasks they are observed and we log as much data as we can both by taking notes under each test and looking through session based log data afterwards. As we are the makers of the system we have hired an external company to prepare and conduct the test to ensure a high degree of impartial results.

It is always useful to “lay it on the line” and let others have a look at your work and tell you what you can do better and what works. We are looking forward to receiving the final report.

Road trip to code4lib

February 3, 2009

We’re going to code4lib 2009 at the end of february! “We” means Mikkel, Mads, Jørn and Toke. None of us has been to code4lib before, but we’re very excited about it. Judging from the website, there will be a bunch of geeks hell bent on making interesting stuff and having fun. Making interesting stuff and having fun is what we do all day at work, but this will be new interesting stuff and new fun.

We’ll be presenting the faceting system used by Summa, more specifically how we’re scaling to a large number of documents and tags. It seems like SOLR is the default choice for a lot of people at the con and since SOLR has its own faceting system, we’ll have to be prepared for some critical questions afterwards. We all love a technical debate, especially when we’re forced to rethink our own strategy, so it’ll be great.

If the computer gods and the organisers of the conference are game, we will also do a lightning talk about Michael’s highly experimental experiment called The SummaSlider.

One little caveat; none of us has been to USA before. We’re kind of scared about customs and the prospect of sitting close to three smelly nerds 2 times 9 hours on the planes. On the other hand, some of us are looking quite forward to the adventure of Mads walking around in his Make them listen! T-Shirt, Jørn selling the rest of us to some gangsters in Central Park, Mikkel getting arrested for being a communist (aka Open Source evangelist) and Toke being fined for jumping naked into snow drifts from the second floor of the hotel.