While setting up ContinuityApp's production environment, the first thing we had to keep in mind was to ensure that no HTTP requests would fail during maintenances and deploys.
ContinuityApp's core is represented by a RESTFul API layer: the whole service is available via API, and all our components (queue workers, logs collectors, frontend web application, custom DNS servers, etc...) use an extended version of these APIs too (in short: nothing but the API processes connects to the databases).
So, if we didn't want to stop a very complex infrastructure every time we need to rollout some new (even minor) release, we had to find a way to just "pause" APIs HTTP requests, leaving requests on hold till the maintenance window would come to the end.
The first time i read something like this was about "broxy" in this post at Braintree "Braintrust" Blog - How they moved theyr Data Center without Downtime; very cool indeed, but i was unable to find "broxy" on GitHub or just googling around: i started a little nightly hackatron to code something togheter, but i had to give up because this was going too far from the ContinuityApp's core (keep the focus!).
Some time later (or some day ago) i stumbled upon "intermission" from 37signals: same concept here, pausing web requests using OpenResty, nginx-x-rid-header and some Lua magic. I was able to compile the required software on one of our Ubuntu Server 12.04 web boxes, but i soon discovered that putting this under Puppet was going to be a pain (again, i should probably have spent more time on this, but i didn't want to distract too much from ContinuityApp's core).
Bonus: both solutions "broxy" and "intermission" need Redis as additional external requirement in order to enqueue "pending" requests. (don't get me wrong: i'm a big fan of Redis, but i didn't like the idea to have an additional thing to monitor and maintain).
Suddenly, the light at the end of the tunnel: how couldn't the mother of all reverse proxies HAProxy support this kind of feature? Well, it doesn't in the actual stable version (1.4.22 as of writing), but it does since the 1.5-dev17 release. All we need to do is just changing at runtime the "maxconn" setting: to pause requests, set this value to 0
echo "set maxconn frontend apicluster 0" | socat stdio /tmp/haproxysock
while you will restore the old value to unpause
echo "set maxconn frontend apicluster 4096" | socat stdio /tmp/haproxysock
There we go! No external dependency like Redis, a very robust software like HAProxy (although we are using a dev version), and an easy way to pause our HTTP requests which can easily wrapped into a Capistrano task!
Next step was to ensure that all our HTTP clients used to connect to APIs, had a reasonable high timeout setting in order to gracefully "wait" for some short maintenance window without failing.
In order to control your HAProxy instance at runtime, you will need to enable the admin socket, adding a line like this to the global section of your HAProxy config file:
stats socket /tmp/haproxysock level admin
Happy maintenance to everybody!
We are in the final stage to finally release ContinuityApp to the public, after 2 years of developing, iterating, fixing, refactoring. If you want, you can reserve your discount by subscribing to our accouncement newsletter, and inviting all your friends to do so!