Archive for the ‘Uncategorized’ Category

Quick and dirty test of the YUI Compressor

July 2, 2009

As a part of our quest trying to optimize the speed of our search front end I recently tried out the Yahoo js and css minifyer – YUI Compressor.

At first glance the nice things about the YUI Compressor are that it is a Java based (we are a Java friendly team), open source and fairly easy to work with. The YUI Compressor handles both javascript and css but in this post I have chosen to focus on the js part.

The test integration into my IDE (Intellij IDEA) and the project was quite easy because somebody has taken the time to write YUIAnt. I just downloaded the YUI compressor version 2.4.2 and the YUIAnt.jar and added them to the project and modified my build scripts to run the compressor when the website is deployed to the web server. The beauty of this is that you naturally don’t have to look at the minified javascript when editing and if you for some reason want to debug the code run time you can easily setup a debug option in your build script and bypass the compressor for on the fly debugging. If you aren’t into all this build script stuff or have a simple project there are lots of online YUI Compressor sites out there where you can paste you js code or css and get a compressed version in return.

The version 2.4.2 of the YUI Compressor nearly worked without problems. For some reason – I didn’t bother to investigate further – the YUI Compressor had some issues with unterminated Strings in the jscalendar-1.0 library. I just excluded the directory and went on with my small non scientific test using Firebug as my test environment.

The first screen shot shows the size and load times for our js files. Business as usual – the YUI Compressor is disabled.

nocompress2Scaled

The next screen shot shows the size and load times for the same js files now with the compression enabled.

compress2Scaled

The file sizes have been reduced and the overall load time has shrunk approximately half a second. When the file sizes are very small the load times are very sensitive to queing effects but the file size is in most cases reduced. In the case of bigger js files the improvement in speed as well as size is clear. I have tried to compensate for caching effects in both cases (compress/not compressed). It seems that there is about a 20-25% reduction in file size and approximately the same reduction in load time for the js. These numbers are without using the obfuscation option (reduction of variable names to the shortest possible length and other tricks) simply because I don’t thing we will be comfortable with this knowing that it might cause errors.

As I am new to this I am interested to hear about any major drawbacks compressing/minifying may have.

This is of course a small step and not something which alone makes the difference between a slow and a fast site but I am hoping that attention to a number of different optimization issues will make a big difference in the long run.

Thoughts on optimizing our search web site

July 1, 2009

wwws

The code for our search front end has over time grown to a considerable size and we have started to suspect that the web site’s response time could be better. With this in mind I have for some time now been keen on looking into optimizing the speed of our front end – especially when the underlying search engine Summa has proven to be blazingly fast.

There are a lot of things we could do better such as:

1. Optimizing the javascript code by trawling through the lot and removing redundancy as well as rewriting some of the methods to be more efficient.

2. A thorough cleanup of the css. There is a lot we can do here as we have loads of redundancy, classes which are not in use anymore and declarations which could be handled way cooler. Another thing I noticed is we like divs – loads of divs.

3. Taking a critical look at our numerous DOM transformations. Some of them are down right unnecessary.

4. General optimizing of the server side code. In fact this part isn’t all that bad but a general clean up once in a while doesn’t hurt anybody.

Because my summer holiday is coming up soon I have chosen to start with some light weight stuff. I have tried out the newest version of the YUI Compressor – tool to compress/minify javascript and css. As we don’t use minifying at the moment we should be able to benefit from it performance wise. In order not to clutter up this post I will post my experience with this in a separate post soonish.

ThreadLocal StringBuilders for Fast Text Processing

March 12, 2009

It is a common task for many server-like applications to process a lot of text-like objects in some streaming manner or other. Many Java programmers tend to think that it is a very good idea to do it like this:

public String processRecord(Record rec) {
    StringBuilder builder = new StringBuilder(1024);
    // Build a string from rec
    return builder.toString();
}

If this pattern looks completely sensible to you then please read this blog post carefully :-)

Debunking A Myth: Object Allocations Are Not Free

Allocating lots of Java objects is a bad idea; simple as that. I recently optimized our object allocation in the Summa indexer using the technique I am about to describe and it gave a 2-3 times increase of our overall throughput! Java, specifically the JVM, is not a magic beast that can dispose and allocate memory for free. The modern garbage collectors are very cool and advanced, but they are only so good. Even though Java is a garbage collected language you still have to think about your memory allocations!

In the example above the StringBuilder is especially bad because it typically allocates a rather big char array. So we should try to avoid that.

Resetting a StringBuilder

Contrary to the impression that the Javadocs for StringBuilder will give most people (in the classical over generalization manner “most people” will mean “me”). You can reset a StringBuilder by doing

builder.setLength(0);

This will not allocate a new char array underneath; I know because I checked the Java 6 source code.

Keeping Only One StringBuilder Around: Thread Locals

Thread locals are an often under used feature, both in Java and in many other languages. With Java generics they are actually a breeze to use. Firstly I should better clarify what “thread local” means. A variable that is thread local will only exist on the thread for which it was created. This makes it easier to handle concurrency because, well, you don’t have to :-) Each thread will have its own copy of the variable around.

Now might be a good time to check out the Javadocs for the TreadLocal class. To create a thread local string builder declare a variable like this:

private ThreadLocal<StringBuilder> threadLocalBuilder =
                                               new ThreadLocal<StringBuilder>() {
        @Override
        protected StringBuilder initialValue() {
            return new StringBuilder();
        }

        @Override
        public StringBuilder get() {
            StringBuilder b = super.get();
            b.setLength(0); // clear/reset the buffer
            return b;
        }

    };

Beware: The above thread local string builder will reset its character buffer each time you grab a reference to it. I usually find myself wanting this behavior, but if you don’t want this you should comment out the setLength(0) line.

To use the thread local builder in a method simply do:

public String processRecord(Record rec) {
    StringBuilder builder = threadLocalBuilder.get();
    // Build a string from rec
    return builder.toString();
}

Caveat Emptor

So what’s the catch? You will be keeping one string builder around for each new thread that ever enters processRecord(). This could potentially end up as lots of string builders if your application is designed like this. Also if you ever build a very large string the string builder will keep its internal character buffer at that size even though you reset it. It will be up to you dear reader to determine if that will be a problem for you. Note however that the thread local variable are deallocated when the thread owning them dies.

Of course one could also add some more intelligent resetting logic in in the ThreadLocal.get() method above. Like allocating a new string builder if b.capacity() becomes too big.

More Optimization: Don’t Allocate the Final String

The observant reader will notice that I also allocate a new String when I do builder.toString() in the end of processRecord(). This can also be avoided, but I have to change the method signature to return a Reader instead of a String:

public Reader processRecord(Record rec);

Unfortunately Java does not come with a Reader implementation that wraps a StringBuilder (or more generally a CharSequence). You can find such a CharSequenceReader in the Summa source code under the LGPL (I would have linked directly to the code in our SVN repo, but it requires a login (which you can create yourself, but it is a pain)).

So the exercise for the eager reader is to wrap that CharSequenceReader in a thread local to also avoid allocating that one again. Note that you also need to reset the CharSequenceReader by calling reset() on it.

Here’s to a faster future!

Lightning Talks

February 26, 2009

Wednesday Toke delivered a lightning talk about the marvels of solid stats drives. Unfortunately Toke had to use to tries as his images (graphs) were not displayed on the conference MAC the first time. People were clearly impressed by the speed of these things and the next presenter actually mentioned that he will look into SSD for his project (Hathitrust).

Toke doing his lightning talk

Toke doing his lightning talk

Mikkel has just delivered his lightning talk on Summa and he enticed the crowd with his storytelling approach and cool graphics on the slides.

Mikkel doing his lightning talk

Mikkel doing his lightning talk

It’s been great, just super. Hans, we’re sending at least two guys next year, you just need to find the money!

And now… New York!

Code4Lib day 3

February 26, 2009

Ian Davis the Talis CTO held the keynote this Thursday morning. The topic was information sharing in the past and in the future. Ian claimed that the the success of the web was not just about free (free as in free beer) information sharing, but also very much based on the ability to link documents.

I must admit that it felt a bit like he was preaching to the choir. The main points was that we should open up data and enable a rich semantic web of linked data. Hard to disagree wit :-)

One point that that I feel worth stressing was when he put up a conjecture on his slides: Conjecture: Data outlasts code. This leads to the following: Corollary: Open data is more important than open source. I am not sure that I necessarily agree, but it is food for thought.

Next on was Edward M. Corrado the head of librarytechnology on Birmingham University. The topic of his talk was the “Open Platform Strategy”. At the core of it I think his main point was that vendors are starting to open up, and even though it is not completely open source we should embrace their initiative (OCLC and Ex Libris are both doing this). It seems to me as a very pragmatic approac (and I love that!), but also a road that could lead to vendor lock-in.

Seen in the light of Sebastian’s (Index Data) thunder talk for open standards yesterday I guess Edward and Sebastian could have a heated debate… :-)

A modern open webservice-based GIS infrastructure” by Adam Soroka & Bess Sadler

GIS systems needs special data repositories because GIS data is weird:

  • The data is huge (terabyte scale)
  • It lives in odd formats
  • It requires special tools for use
  • It deserves special descriptions

Open geospatial consortium produces standards (not tools) for GIS data. There exists sevaral standards for GIS data:

  • ISO TC 211
  • GML (geographic markup language)
  • ISO 19115 (UML based) / ISO 19139 (standard for serializing 19115 to XML)

Service standards (retrieve data):

  • Web map service – Query is simple key/value and it can return a veriaty of formats (pdf, png, jpg etc.)
  • Web feature service – returns GML (geographic markup language)

Tools needed to get a GIS running:

  • Database
  • Geoserver
  • Geonetwork

“Vizualizing Media Archives: A Case Study” by Chris Beer and Courtney Michael.

Media archives are different. Important to present media data in context. Linked data is used to present data while keeping them in context.

Media archives are visual, and traditional library interfaces are almost all text based. So what they have made is a system to grpahically display images and their relations in a graph – the graph is interactive allowing the user to browse through the data and change their focus.

(Notes by Mikkel, Mads & Jørn)

code4lib day 2

February 26, 2009

(Notes by Mads and Jørn, showeled in by Toke, let’s hope we find the time for proper edits later – we’re running very fast here, but it’s great)

Keynote: Sebastian Hammer

Journals and books will disappear faster than we think.

Maybe we don’t need to follow market forces – sometimes they fail. We need to do more of the boring stuff – ie. less code, more standards.

The libraries need to be more proactive.

What (local) libraries do well:

  • Preservers of cultural heritage
  • Conveyers of authoritative information
  • Supporters of learning and research
  • Pillars of democracy

Libraries need to be the more open alternative to the commercial players – ask questions, put up a fight.

It needs to be a lot easier to put systems together – standards are needed for collaboration.

Systems and organisations need to surrender data freely.

The Open Library Environment
Lifting the ILS up to be a more integrated part of existing system – ie. if there is already an aquisition system, then talk to that, integrate with existing single sign on solution. The is no code yet from the Open Library Environment, but many plans for architectural components.

OLE is defining the core Service-Orientede architecture – but are very interested in feedback, and for that they have put up a survey

http://oleproject.org/2009/02/24/ole-project-related-applications-servey/

A running system is 2 to 3 years away.

Blacklight as a unified discovery platform
“Yet another next-generation catalog”. Very much about serendipity. A lot of the “aha”-moment come when you are browsing – and that is being taken away from the users by just giving them the exact electronic versions, and there is currently nothing there to replace this serendipity.

Solr as a backend, so it has relevancy ranking, facetting, unicode support, etc. All of which are great for the user experience. Blacklight also has permanent URLs, RSS-feeds, and more. Allows RESTful access to their data to make it easier for other people to do mashups.

No single interface can fit all – ie. chemistry students have different need than students from the music department. So they have made specific search

interfaces available to help answer the common questions posed by the different, for example searching by musical instruments. Create new fields when indexing to facilitate searching, for instance based on the year the music was composed they create groups of genres.

Blacklight also contains all the data from their Fedora installation using gsearch so they get live updates to their Solr index.

Items from collections have different behavior, ie. a scanned picture is displayed differently than a scanned book.

Cooperates with Vufind to index marc in Solr (solrmarc), and also do a lot with marc4j. Standardizing on jangle.

A new platform for open data – biblios.net web services – Galen Charlton (LibLime)

  • libraries agree with principles
  • not all have the staff
  • sell sevices

open data – the final frontier

  • libs provide licences but not open dt th cm

open library project

open data commons licence

interface libs for UI’s

biblios.net

  • free browserbased cataloging service
  • pazpar2
  • webservices – to interact with biblios.net – push/pull records
  • push records to create standardized access
  • support for SRU/OpenSearch search
  • multiple formats (XSL transformations) MARC/dc/mods

Extended biblios – the open source web based metadata editor – Chris Catalfo

  • implementing plugins to extend biblios.
  • loads any plugin defined in biblio’s config file
  • example – create editor plugin
  • example – adding Extjs/documents using CouchDB as backend
  • example – listening to biblios.app events

What we talk about when we talk about FRBR – Jodi Schneider

We talk about different things when talking about FRBR

if I refer to a book – different place but no connonical version. We would like one place. Same with lib catalogue – can i get it. FRBR as obout relationships.

Weak idea of FRBR:

  • group manifestation
  • work – manifestation relation
  • work state grouping – xISBN/LibraryThing service (group manifestation service)
  • status of FRBR work set grouping

There is more to it

  • work set grouping does not say anything about contents

Less weak FRBR:

  • open library –> example of work identifier (instead of just grouping)

Strong FRBR

  • we must have items and expressions related back to work and manifestation

example: LC FRBR display tool

  • beyond just works and manifestion

example: VTLS catalogue

  • collections by works/manifestations (all of author)
  • figure /group 1&2&3 enteties (the complete FRBR notion)
  • connect and remix, link up with other systems representaion of FRBR i RDF (IFLA, Davis&Newman), LIBRIS linked data (FRBR related tag), id.loc.gov,

We are going for a (re)usable biblioegraphic universe!

What can we do?

  • demand strong FRBR
  • build linked data (freebase)
  • create the algorithms – share under open license

Complete faceting
Toke held a good and very technical (read super geeky) talk on Summa’s faceting system. Of course some people didn’t understand a word he said (thats not nescessary a bad thing) but those who did get it were very excited. Toke was a very ‘in demand’ person after his talk. We are looking forward to seeing which opportunities this will bring.

The Rising Son – making the most of Solr

performance

  • java is not slow, measure.

memory

  • omitNorms, omitTf

query parsing

  • “it depends”

Data import

  • …DIH, Solr Cell, CSV, LuSql…..
  • SOLR ruby mapper

request handler

Solr as IR toolkit

  • frontend to Lucene

LocalSolr

  • geo searching, submit lat/long queries

Faceting

  • Solr 1.4 perfomance – dramatically advance in performance. Multi select facets.

User interface

  • “the interface is the app”. Abanded the bottom up (data to app). Think app. first and bring it down (app to data).

SolrJS library – standard AJAX components

Lucid articles – tutorials – podcasts – blog

FreeCite – An open source free-text citation parser
Background:
example – Brown searchable database – no citation links
first step is parsing the data to the relevent fields. FreeCite handles this.
FreeCite has a webservice API.
What does FC do?
response: OpenURL – ContextObjetct and JSON. “This is rocket surgery” –> the data is not clean (would be trivial if the data was wellformed). E.g. letters in volume, title and jounal name after each other.

freecite.library.brown.edu

Great facets, like your relevance, but can I have links to Amazon and Google Book Search  – squeezing more out of the OPAC

Goldrush discovery UI, a lot of examples. “But we want more!”. Lots of individual hacking..no similarity and that gives low reuse. We need component like extension. The answer is JUICE. Examples of contents pulled in via Iframes, oplæsning, fadedown integrated on the web site (greasemonkey way).

JUICE (supported by Jquery) –> metadefs –>panels –>extension
- a few JS lines ingested to the page

Why JUICE?

  • easy to copy/paste
  • easy to create new by modifying
  • shares OPAC specific knowledge
  • very little product specific dependencies
  • open source

juice-project.code.google.com

Freebasing for fun and enhancement – Sean Hannan

REST, api.freebase.com. Returns JSON.
Example: Acedemy ward winners: FreeBase schemas defines the fields. Acre – templating. Subquering – open a new JSON block and query again.

Usability testing our Summa front-end

February 4, 2009

We are in the midst of conducting a usability test of Search which is our front-end to Summa. The test is a straight forward usability test where 8 users are invited to try out the system and along the way they are provided with different tasks to ensure that they get around most corners of the site. While the users try it out (one at a time) and solve tasks they are observed and we log as much data as we can both by taking notes under each test and looking through session based log data afterwards. As we are the makers of the system we have hired an external company to prepare and conduct the test to ensure a high degree of impartial results.

It is always useful to “lay it on the line” and let others have a look at your work and tell you what you can do better and what works. We are looking forward to receiving the final report.

All the cake you can carry

January 29, 2009

Today is annual eat-all-the-cake-you-can-eat-or-carry in the cantina.
The concept is straightforward: 62 cakes. 20 kroner. One plate.

Obviously we have a strong tradition of commemorating this day. This year was no exception. (Actually, this year the concept celebrates its 10th anniversary).

Here are some visual impressions from the feast by way of Toke.

thumb_00007
Alert and agile young developers going for the cake. Note how Thomas has taken the lead.

thumb_00014
62 cakes and 1 billion people.

thumb_00017
Cantina boss wondering whether Jeppe has violated the rules by putting an entire cake on his plate…It later turned out that he hadn’t.

thumb_00042
And the winner in the “most obscene number of cakes which I know I can’t eat anyway but hey its free” category: Thomas

thumb_00048
Drowsiness.

Who manhandled the whipper?

January 27, 2009

Yesterday the coffee machine started displaying the very self-explaining error message “Whipper XX overload”.

Self-explaining error message

Self-explaining error message

Nobody really knows what this means but since the machine still seems capable of pouring mediocre coffee there is no panic yet. So the bottom line is that we still got coffee but somebody has manhandled the whipper to an extent that caused overload. We will let the service guy/girl sort out the rest.

Learning how to comminicate and cooprate in teams

January 22, 2009

Yesterday we all went to get a brush-up on how to communicate and make better teams. Actually, it was not only us, but the whole it department totaling to 42 bodies. The man to guide us was Ejnar, a former school teacher turned consultant. He also volunteers as a handball coach and was actually able to shout louder than anyone I’ve ever met. He was actually a very nice and down-to-earth fella.

The day involved identifying what good communication and teamwork is all about and eventually trying to work out areas where we could improve – both as a whole department and as individual groups. Ejnar employed an Open Space approach. It allows one to extract common themes from seemingly disparate, chaotic data. It worked nicely.

The highlights of the day were Ejnar spontaneously bear-hugging Mads – who is notoriously shy of physical human contact – and the discovery that the chicken-disguised-in-tomato-sauce served for lunch was partly raw.

Afterwards, we all went to eat and drink at Bryggeriet. We had a nice roast and several good beers. Especially the the Christmas Brew (yes, still available) and Dark Ale was great – the latter with nice notes of liquorice.

The day in photos (thanks to Toke)
Unfortunately, we have no photos of Mads being hugged…

bi_20090121_1152
A little team building before lunch

bi_20090121_1312
Lunch…

bi_20090121_1440

bi_20090121_1456
Doing the Open Space job.

bi_20090121_1449_2
The consultant’s suitcase(s).

bi_20090121_606
Me talking. Please note the subtle Acer logo from the projector.

bi_20090121_1954
Free beer and dinner. Nice.