Searching at Statsbiblioteket has been slow the last couple of days and the condition has grown progressively worse. Yesterday evening and this morning, using the system required the patience of a parent. Consequently the full tech stack congregated at the office (maintenance BOFH, backend bit-fiddler, web service hacker and service glue guy) hell-bent on killing bugs.
A hastily cobbled together log propagation thingamabob claimed that the backend answered snappy enough, network inspection showed a very high amount of requests to storage (a database that contains the records backing the search) and timeouts & session pool overflows were all over the place. The DidYouMean service was appointed scapegoat and killed with extreme prejudice.
Things got exponentially worse! Uptime was about 2 minutes, with last minute performance quickly falling off to unusable. Phones started ringing, emails ticked in and an angry mob with pitchforks lay siege to the office. Inspection revealed that killing DidYouMean meant that the service layer unsuccessfully tried to get the result for 20 seconds (yes, that timeout was far too high) before giving up, quickly filling Apache session pools. DidYouMean was resurrected, services started up again and all was well. Or at least back to where it was before the wrong service was unfairly executed.
Waiting for the next hammer to drop, code was reviewed (again), pool sizes were tweaked and logs were watched intensely. At 12:09:47 and 952 milliseconds, the impact riveter started again and storage staggered. But lo and behold: The maintenance guy had changed log levels to DEBUG (for a limited amount of time). An hour and 20,000 observed requests for the exact same ID later, the magical incantation
i++; was inserted in a while loop. Testing, deployment, re-deployment, tomcat restart and another tomcat restart followed quickly.
It turned out that certain rare records triggered the endless loop. The progressively worse performance stemmed from more and more of these edge cases piling up, each looping forever. As the overwhelmed storage was on the same server as the searcher, the shared request pool was flooded with storage requests, only occasionally allowing search requests.
With the roaring fire quelled, business returned to normal. By pure coincidence, the assignment for the next days is vastly improved propagation, logging and visualisation of timing information throughout the services.