The dull glow of the monitors bathed our furrowed expressions in pale blue light as we poured over the HAProxy logs. It was closing in on midnight, and we had already spent several hours attempting to track down the source of our woes.
It didn’t make any sense—we had done everything right. The new infrastructure had been configured and set up well in advance; all of the latest code had already been deployed and tested internally; the dry run of the production data cut-over had, after a few minor bumps, gone completely smoothly; and the DNS switch went off without a hitch. By all rights, this deployment—the release of over half a year’s worth of planning, development, testing, sweat, blood, and tears—should have been a shining success.
So why were some of us still staring at a blank screen where the login prompt should be?
Theory 1: PEBKAC
Of course, the first place any well-seasoned engineer’s mind typically goes to when assessing a newly reported problem is user error. Works on my machine! Problem Exists Between Keyboard And Chair. In this case, we were able to dismiss this possibility relatively quickly, as we had been able to reproduce the problem on more than one machine. Unfortunately, we could no longer reproduce it, due to the other first place any well-seasoned engineer’s mind goes to when assessing problems. At a certain point, it becomes second nature: Clearing browser cache and cookies is not so much one of many possible things to attempt in the course of diagnosing a problem as it is an instinctual reflex.
A problem that is known to exist but cannot be reproduced is the bane of software engineers under any circumstances, but when one rears its ugly head in the middle of the night on the eve of a major product release it can be especially vexing. We were left with no recourse aside from staring vacantly at error logs and performing Hail Mary web searches in the hopes of gleaning some as yet undiscovered insight.
Eventually there was a minor breakthrough. The good news was that we managed to locate a web browser in the office that was still reproducing the issue. The bad news was that the browser was Internet Explorer 11 running in a virtual machine that randomly rebooted itself because the installed copy of Windows had never been registered.
Following a combination of studying the error response payload (once we managed to figure out Internet Explorer’s archaic and unfamiliar developer tools—no small feat), learning from the logs that the errors were being thrown before even hitting our application, and a few lucky DuckDuckGo searches, we formulated a new theory.
Theory 2: HAProxy Header Length Limitations
The errors being generated had an HTTP status code of 400, which indicates that the rejection occurred due to a client error, specifically because of a “Bad Request.” Our application logs showed nothing, but the errors were being recorded in our HAProxy logs. In our architecture, incoming requests first arrive at HAProxy, which then reroutes the request internally to different services based on the URI. Putting these two clues together eventually led us to the discovery that HAProxy will return a 400 response code in the event that the length of any header in the request exceeds a certain size limit.
Further struggles through surprise VM reboots and Internet Explorer’s frustratingly laid-out debugger revealed that it was sending a
Cookie header in excess of 64Kb. The dim room illuminated slightly from the newly un-suspended developer laptops, though it could just as well have been from the light bulbs going off over our heads. A few minutes with a more respectable browser and a cookie editing extension suddenly meant we could now reproduce the issue at will. Hallelujah!
Putting aside for the moment the question of why we had browsers sending cookies that take up more memory than a rewrite of the Wolfenstein 3D engine, we quickly identified the HAProxy configuration to increase the header size limit and pushed a change to our testing environment. We also got clever, knowing that HAProxy redirects requests through Nginx servers before hitting our application, and updated an equivalent Nginx configuration.
In all my experience as a developer I have never seen a group of software engineers so gleeful to be crowded around a virtualized copy of Internet Explorer, nearly shaking with anticipation as the refresh button was clicked.
Only to be greeted by a 400 error response.
More frantic investigation followed, including looking into the source and cause of the Goliath cookies. The problem appeared to be due to various third-party components we used in our app. A misconfigured analytics service meant multiple copies each user’s JSON session data was being stored in cookies, as well as an updated customer service component that appeared to be cookifying the user’s entire site browsing history. We fixed the configuration of these components where we could, but for existing customers the damage would have already been done—for some of them the size of the cookies generated by the old product would have slowly built up over time to a point where even the small additional overhead added by the new product was enough to push it over the limit. It’s quite possible that users had already been experiencing the issue on our old product, infrequently enough to be brushed off as an edge case and solved by instructions to clear cache and cookies.
It was around 3AM when the decision was made that, since the problem was potentially limited to only a small portion of our user base, and could be mitigated by instructing those users to clear their cookies, we would all go home for a few hours of sleep and reconvene in the morning.
The product launch could now be considered more or less successful. Sort of. Except for that one thing. Yay.
Theory 3: HAProxy and Amazon S3 Header Length Limitations
One of the places that HAProxy redirects our requests to is a static website served out of Amazon’s S3 service. Our entire front-end application is served in this manner. Inspecting the response payloads of the new errors the next morning revealed that they were similar to but different from the errors we had seen before the HAProxy change. One of the clues in the new response payload included the header
Server: AmazonS3. More light bulbs. If HAProxy barfs on ridiculously large headers, it’s not entirely out of the realm of possibility that S3 has a similar limitation. A little digging around in the Amazon documentation and various forums seemed to confirm this.
The problem presented to us now was that we were dealing with a hosted service controlled by Amazon. We couldn’t simply go update a configuration in S3 to allow longer request headers as we could with HAProxy. After mulling over our options, we decided to update HAProxy with a new rule that would strip the
Cookie header from all requests that were forwarded to S3. We were only serving static files from S3 anyway, so the request cookies were unimportant.
Once again we deployed the fix to our test environment and crowded around our randomly rebooting Internet Explorer VM. The tension in the air was palpable as the refresh button was clicked.
Success! In place of the dreaded 400 error page we beheld our glorious login prompt. Now in better spirits, despite many of us functioning on less than two or three hours of sleep, the login form was gleefully filled out and submitted.
And nothing happened. We once again performed the nightmarish summoning ritual to invoke the Internet Explorer developer tools, and clicked the network inspection tab.
400 Request Header Or Cookie Too Large
Theory 4: HAProxy, Amazon S3, and Nginx (Again) Header Length Limitation
Back under Theory 2, we got clever and preemptively adjusted our Nginx configurations to handle larger cookies. What we had failed to take into consideration is a second layer of Nginx redirects earlier up the request chain running a NAXSI firewall. The configuration of these Nginx servers was independent from the ones we already adjusted at the API layer. As such, while we may have been patting ourselves on the back for our cleverness, it turns out we weren’t quite clever enough.
We updated all of our Nginx configurations this time, and pushed them out to the test environment.
By this point, a full business day and the weekend had already passed since we released the product. We had been mitigating the cookie issue all along by letting customers know to clear their cache and cookies to fix any issues with the new site, and the related support calls had already slowed to essentially nothing.
It was also by this point that we stopped being able to reproduce the problem at all in the production environment, even on our raggedy Internet Explorer virtual machine.
We had triumphantly solved the final piece of the puzzle, just in time for the last of the damn cookies to expire anyway.