Fire fire everywhere

by

Searching at Statsbiblioteket has been slow the last couple of days and the condition has grown progressively worse. Yesterday evening and this morning, using the system required the patience of a parent. Consequently the full tech stack congregated at the office (maintenance BOFH, backend bit-fiddler, web service hacker and service glue guy) hell-bent on killing bugs.

A hastily cobbled together log propagation thingamabob claimed that the backend answered snappy enough, network inspection showed a very high amount of requests to storage (a database that contains the records backing the search) and timeouts & session pool overflows were all over the place. The DidYouMean service was appointed scapegoat and killed with extreme prejudice.

Things got exponentially worse! Uptime was about 2 minutes, with last minute performance quickly falling off to unusable. Phones started ringing, emails ticked in and an angry mob with pitchforks lay siege to the office. Inspection revealed that killing DidYouMean meant that the service layer unsuccessfully tried to get the result for 20 seconds (yes, that timeout was far too high) before giving up, quickly filling Apache session pools. DidYouMean was resurrected, services started up again and all was well. Or at least back to where it was before the wrong service was unfairly executed.

Mads coding

Mads stands up and codes. A sure sign of high alert

Waiting for the next hammer to drop, code was reviewed (again), pool sizes were tweaked and logs were watched intensely. At 12:09:47 and 952 milliseconds, the impact riveter started again and storage staggered. But lo and behold: The maintenance guy had changed log levels to DEBUG (for a limited amount of time). An hour and 20,000 observed requests for the exact same ID later, the magical incantation i++; was inserted in a while loop. Testing, deployment, re-deployment, tomcat restart and another tomcat restart followed quickly.

It turned out that certain rare records triggered the endless loop. The progressively worse performance stemmed from more and more of these edge cases piling up, each looping forever. As the overwhelmed storage was on the same server as the searcher, the shared request pool was flooded with storage requests, only occasionally allowing search requests.

With the roaring fire quelled, business returned to normal. By pure coincidence, the assignment for the next days is vastly improved propagation, logging and visualisation of timing information throughout the services.

About these ads

One Response to “Fire fire everywhere”

  1. Toke Eskildsen Says:

    Thank you for your suggestion, but the color for the header is a (very subtle) reference to the primary color used for an older version of Summa. We kind of like such half-hidden hints.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: