Transparent deployment with DNS-balanced Apache servers
February 16, 2007
Casey Muller

Here's the quick way I set up zero-downtime deployment on our two production servers.

Our old setup

Most Ruby on Rails sites that use mongrel clusters will have entries along these lines in their Apache config files:

    # Check for maintenance file and redirect all requests
    #  ( this is for use with Capistrano's disable_web task )
    RewriteCond %{DOCUMENT_ROOT}/maintenance.html -f
    RewriteRule ^.*$ /maintenance.html [L]
    # ...
    # more exceptions and caching stuff, followed by
    # ...
    # All remaining requests get sent to the cluster
    RewriteRule ^/(.*)$ balancer://jamglue%{REQUEST_URI} [P,L]
    # ...
    # Configure the cluster proxy
    <Proxy balancer://jamglue>
        BalancerMember http://127.0.0.1:4000
        BalancerMember http://127.0.0.1:4001
        # ...
    </Proxy>

So the old deployment technique was to put a maintenance message in public/maintenance.html, restart that server's mongrels, then remove the message. That way users that hit the server would see a message telling them about momentary downtime rather than some kind of 503 error or a partially-rolled out exception.

We got tired of having these momentary downtimes dictate our deployment style- there's no reason to prefer waiting until 2am to push out the day's changes, especially with two servers.

Removing the user-visible downtime

So here's my solution, adding a redirect to the other server instead:

    # this goes up near the top of the rewrite rules
    RewriteCond %{DOCUMENT_ROOT}/maintenance_redirect -f
    RewriteRule ^/(.*)$ balancer://maintredir%{REQUEST_URI} [P,L]
    # ...
    # Configure the maintenance redirect proxy
    <Proxy balancer://maintredir>
        BalancerMember http://otherserver.jamglue.com:80
    </Proxy>

Now instead of putting an error message at public/maintenance.html, we just touch public/maintenance_redirect and the server being restarted will seamlessly send all its traffic over to the other server. If you've got more than two, just add as many balancermembers as you'd like.

Future Improvements

There are two improvements on this I've been thinking about.

The first is obviously to catch loops in case both machines are redirecting at the same time (I haven't tried this). It's easy enough to add an extra query string to the maintredir's proxy line, then check it on all the servers so that you never redirect an already-redirected link. For more than two servers, the query string could even contain the hostname, so selective redirection could be done.

This hypothetical query string could be stripped out using a QSA rewrite rule to remove any chance of it getting shown to the rails app or the user.

The second is adding another virtual host or cookie that overrides the redirect flag. That way you can roll your new code to one server, and be the only one to see it up and fully running on your production hardware. When you're happy, you flip the site over and upgrade the other server (which you could then also check).

I had a cookie doing this for testing purposes, but removed it for simplicity's sake. It's as simple as adding a line like:

    RewriteCond %{HTTP_COOKIE} !dontredirectme
to the RewriteCond chain, and giving yourself a site cookie with the same name.

And of course, these could be combined- the automatically added query string that overrides the redirect could be the testing method also.

Doing this with only one server

Of course, all this is completely applicable to a single machine also, and even simpler. Just set up two different balancer clusters and only use each if the appropriate redirect file isn't present. Separate code bases for each would be kind of messy, but certainly the safest.

Downtime still happens

Of course, because the two servers are sharing a DB, certain deployments still may require brief downtime. However, since we use multi-master replication between the two machines, it's certainly tempting to try to configure a temporary stop of replication from one DB for the duration of the deployment and then let the other catch up once everything looks good... it's probably not worth it though.

Today's photographic nomination is in Abstract:

previous entry:

Audio in and around the modern browser