Hurry! The train is coming!

Ye Olde Chu Chu Train aka Stable Summa proved to be not so stable after all. Or did it? Well, that depends on the eyes behind the big finger of blame…

A high-profile source for the Search system at Statsbiblioteket was only partially indexed some days ago. Okay, funny stuff happens at times. Maybe it was faulty source data, maybe the bad moon, maybe network problems? Let’s give it another go, the crafty developers agreed. Lo and behold, after the second go… the system still tanked for the same source (and for another source, but nobody noticed, because that wasn’t as high-profile and could comfortably hide in a corner).

Twice in a row is bad. A rollback to a working index was performed. Again. The accusing finger was pointed at some suspicious-looking machines in the distributed indexing network, they were excluded and the third round was started. In reality the machines were just scapegoats and fearing waterboarding by the users, the not-so-cocky developers began to tear index-segments and log-files apart in search of The Real Explanation.

It turns out that Lucene is known for corrupting indexes under certain circumstances. Aha! So Lucene was responsible! …Err, no. The error was with Java HotSpot version 1.6.0_04 – 1.6.0_10-b25 and was triggered by Lucene when working with large (~20GB) indexes. While the single machines in the distributed network never reached that size, they surely generated more than that in total. If the HotSpot-error is triggered mainly by chance and not by index-size, which seems very likely, that would explain the problems.

The bach-generation of the new index was underway. It was still in the ingest-phase, but quickly approaching the distributed index-phase and thus another crash. The frantic developers checked Java-versions, excluded some machines and up- or downgraded other machines to ensure that the network was clean. Happy fun time, but it was done in time and the tracks are now in order. The train will enter the destination tomorrow evening, so here’s to a safe journey.

train_bridge

All men on deck! Fix the tracks! (picture by Thomas Milne)


All this would of course not have happend if the production system was running the new Summa instead of the old one (yes, we’re working on the switch, but we keep ketting sidetracked by firefights like this one). You see, the new Summa train is capable of flying and hitting 88 mph!

*ahem*

Time to go home.

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in Hacking, Statsbiblioteket, Summa. Bookmark the permalink.

One Response to Hurry! The train is coming!

  1. Toke Eskildsen says:

    Should anyone wonder then yes, purging the buggy Java-versions did work. We’ve added code to Summa that checks for these versions upon startup, to help Summa-users avoid that particular trap.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s