Our deployment process at UrbanBound has matured considerably over the past year. In this blog post, I’d like to describe how we moved from prescribed deployment windows with downtime, to a deploy-when-ready process that could be executed at any point in time.
The Early Days
About a year ago, UrbanBound was in the middle of transitioning from the “get it up and running quickly to validate the idea” stage to the “the idea works, let’s make sure we can continue to push it forward reliably” stage. Up until that point, we had accrued a significant amount of technical debt, and we didn’t have much in the way of a test suite. As a result, deploys were unpredictable. Sometimes new code would deploy cleanly, sometimes not. Sometimes we would introduce regressions in other areas of the application, sometimes not. Sometimes deploys would interfere with users currently using the app, sometimes not. Our deployment process was simply not reliable.
Stopping the Bleeding
The first order of business was to stop the bleeding. Before we could focus on improving the process, first we needed to stop it from being a problem. We accomplished this with some process changes.
First, we decided to limit the number of releases we did. We would deploy at the end of each two week sprint and to push out critical bug fixes. That’s it. We made some changes to our branching strategy in git to support this work flow, which looked something like this:
All feature branches would be based off of an integration branch. When features were completed, reviewed, and had passed QA, they would be merged into this integration branch.
At the end of every two week sprint, we would cut a new release branch off of the integration branch. Our QA team would spend the next few days regression testing the integration branch to make sure everything looked good. From this point on, any changes made to the code being released, a fix for a bug QA found for example, would be made on this release branch, and then cherry picked over to the integration branch.
When QA was finished testing, they would merge the release branch into master, and deploy master to production.
Emergency hotfixes would be done on a branch off of master, and then merged into master and deployed when ready. This change would then have to be merged upstream into the integration branch, and possibly a release branch if one was in progress.
A very similar workflow to the one described above can be found at http://nvie.com/posts/a-successful-git-branching-model/
This process change helped us keep development moving forward while ensuring that we were releasing changes that would not break production. But, it did introduce a significant of overhead. Managing all of the branches proved challenging. And, it was not a great use of QA’s time to spend 2-3 days regression testing the release branch when they had already tested each of the features individually before they were merged into the integration branch.
Automated Acceptance Testing
Manually testing for regressions is a crappy business to be in. But, at the time, there was no other way for us to make sure that what we were shipping would work. We knew that we had to get in front of this. So, we worked to identify the critical paths through the application...the minimum amount of functionality that we would want covered by automated tests in order to feel comfortable cutting out the manual regression testing step of the deployment process.
Once we had identified the critical paths through the application, we started writing Capybara tests to cover those paths. This step took a fair amount of time, because we had to do this while continuing to test new features and performing regression testing for new releases every two weeks. We also had to flush out how we wanted to do integration tests, as integration testing was not a large part of our testing strategy at this point in time.
Eventually, we had enough tests in place, and passing, that we felt comfortable ditching manual regression testing effort. Now, after QA had passed a feature, all we needed to see was a successful build in our continuous integration environment to deem the code ready for deployment.
Zero Downtime Deploys
We deploy the UrbanBound application to Heroku. Personally, I love Heroku as a deployment platform. It is a great solution for those applications that can work within the limitations of the platform. However, one thing that is annoying with Heroku is that, by default, your application becomes totally unresponsive while it reboots after a deploy. The amount of time it is down depends on the application, and how long it takes to boot. But, this window was large enough for us that we felt it would be disruptive to our users if we were deploying multiple times per day.
Thankfully, Heroku offers a rolling reboot feature called preboot. Instead of stopping the web dynos and then starting the new ones, preboot changes the order so that it first starts the new web dynos, and makes sure they have started successfully and are receiving traffic before shutting down the old dynos. This means that the application stays responsive during the deploy.
However, preboot adds a fair amount of complexity to the deployment process. With preboot, you will have the old version of the application running side-by-side with the new version of the application, worker dynos, and the newly migrated database, for at least a few minutes. If any of your changes are not backwards compatible with the older version of the application (a deleted or renamed column in the database, for example), the old version of the application will begin experiencing problems during the deploy. There are also a few potential gotchas with some of the add-ons.
In our case, the backwards compatibility issue can be worked around fairly easily. When we have changes that are not backwards compatible, we simply deploy these changes off hours with the preboot feature disabled. The challenge then becomes recognizing when this is necessary (when there are backwards incompatible changes going out). We place the responsibility for identifying this on the author of the change and the person who performs the code review. Both of these people should be familiar enough with the change to know if it will be backwards compatible with the version of the application currently running in production.
The End Result
With the automated acceptance testing and zero downtime deploys in place, we were finally ready to move to a true “deploy when ready” process. Today, we deploy several times a day, all without the application missing a step. No more big integration efforts, or massive releases. We keep the deploys small, because doing so makes it much easier to diagnose problems when they happen. This deployment process also allows us to be much more responsive to the needs of the business. In the past, it could be up to two weeks before a minor change made it into production. Today, we can deploy that change as soon as it is done, and that’s the way it should be.