This post is part of a series:
Introducing quality, Part 2: Continuous Integration and Deployment Automation
In the first part of this series we talked about all the changes we made in the Bingo Platform team to try and improve quality in our code base. In this post we will see what happens to the code after we commit it to GitHub and how we improved that process to give us more confidence in our deployments.
Continuous integration is all about integrating our code to a main branch continuously so that we can avoid merge issues when a lot of people are working on the same code base. That practice has already been in place at Tombola for a long time and has been working very nicely for all the teams involved. The next part in the DevOps (yeah, I re-used the overused term) mentality is fast feedback. Developers can get feedback on their local machines but we have a lot of people using, verifying and testing our code. The faster the code can get in their hands the faster they can give us feedback and get the user story to completion. That means that we need a fast and robust deployment pipeline. As an extra perk, the same pipeline will deploy to live quickly and reliably and offer flexibility to rollback easily when there is a problem.
Where we started from
As we briefly touched on the previous post, we already had some automation in place to get our code from the repo to dev and stage environments and then to live (still scripted but with humans initiating the process and approving the Blue/Green swap). We used TeamCity (we’ll abbreviate to TC) to pull code from the repo, run unit tests, build nuget packages, and push them to Octopus Deploy (OD from now on). Then OD would deploy them automatically to dev and stage environments. Finally some GhostInspector smoke tests would be ran on the deployed site.
This process does what it says on the tin, every commit is automatically deployed to dev and stage environments but there are a few issues:
- There is no overall visibility of the whole process (build, test, deploy). One would have to switch between TC and OD and different build logs to follow the deployment.
- The GhostInspector tests are run after we have deployed to dev, so there was a chance that we could have a broken dev site. That is not optimal as many teams use that environment at the same time.
- The GhostInspector test suite was also independent from the code (i.e. not source controlled) so they were harder to maintain and keep in sync with the code base.
- The OD process had a lot of custom script steps that have been applied overtime and were out of date. None were source controlled. That made it hard to update them confidently.
- The OD process also handled notifications and manual script running.
- The OD deployment scripts were largely manual ones and the process was a bit confusing for a newcomer.
- Sometimes failures in a non-important step would stop the process (OD can bypass these steps though).
- There were overall close to 50 OD steps that ran sequentially. The whole process took a bit less than 4 minutes per environment.
- The TC pull and build step took anywhere between 4 and 23 (!) minutes. That had to do with the fact that our TC agents are virtual machines that recycle regularly and we could be cleanly pulling a rather large repo on them multiple times a day!
When deploying to live we would just promote the OD release to the our inactive live nodes (we have a nice Blue-Green setup for live deploys). Again OD will find the inactive node with a manual script, deploy to it, request manual intervention and then switch nodes. There was no easy way to rollback though (however rarely that happened); an authorized individual would manually do it.
One more point worth noting is that the Ops team had to get involved in certain scenarios that hadn’t been automated and that meant the occasional back and forth that inserted unnecessary delays. Overall we had a functional automated process in place, it just needed some love put into it in order to make the most out of it and align it with how our team worked.
Some house cleaning
One of the main pain points that the team recognized was how hard it was to debug and alter the OD process. After some thought it made sense in our minds that OD is not the best tool for continuous integration (it can’t integrate with code repositories for a start). It does provide enormous flexibility in the way one can use it but what it does best is managing deployments. Especially when you have cases of multiple machines, environments and tenants. Then the tool really shines with the complexity it takes out of the picture. TeamCity on the other hand can do everything that we were doing in OD apart from the complex deployments themselves. To be clear, TC can also be configured to manually do everything OD does but the configuration overhead is a lot bigger. So the new plan was to make TC our “command center” and use OD only for what it does best: deploy. We had nice tools, it was time to use them to their full potential.
The main benefits that we looked for where all around the ideas of great visibility, fast deployments and more control of the deployment by the team itself.
First thing that a team of more than one person needs is a consistent view of the state of the work. In the case of delivery we wanted to make it as easy as possible for someone to see the state of the delivery of some code or a feature. For that reason a dedicated TC project was created, with a few build configurations at first to do the building and automated testing, and sub-projects for the deployment to each environment. Everything concerning the deployment is right there in one view: Git checkouts, test results, OD deployments, deploy times for every step and overall status of the deployment. If one needs more info they just dig deeper in one build log. The build chains in TC also represent the whole deployment now:
One of the actual team metrics that we could improve with the deployment pipeline was the user story cycle time and the work in progress per developer. These values were a bit higher than we would like them to be. Checking a piece of functionality on the dev and stage was slow because of the overall time it took to deploy to them. Consequently the testing from QA and feedback from the stakeholders would be delayed too, leading the developers to pick up more work while waiting and just feeding the beast of delay:
When feedback did arrive, action would either wait for some other piece of work to be completed or the developer would have to do some pretty impressive context switching to respond to everything at once. Most of these issues can be avoided by making sure that when someone commits something to the repo, it will be in dev and stage ready for testing as soon as possible. Feedback can be immediate, developers can focus on one piece of functionality at a time and stories completed at a very improved rate.
Many small little improvements happened towards this goal, including:
- Use of multiple TC agents to execute things in parallel. Snapshot dependencies make sure all the right steps will run in the correct order.
- Blocking the deployment to any environments if any functional tests fail. This way the developer get the feedback that they broke something a lot earlier and no other teams are affected if something dodgy has passed manual code review.
- Blocking the deployment to any further environments if any smoke tests fail. This is the earliest we can weed out configuration errors.
- Improved our checkout times to be under 30 seconds every time, even if the TC agents get recycled.
- Improved our build times by using a custom build script instead of TC’s default solution template.
- Running automated tests in parallel with the use of NUnit’s multithreading feature.
- Syncing resources in a more efficient manner to avoid unnecessary overwrites.
- Improved notifications to HipChat and email.
These were incremental improvements that we recognized overtime while using the pipeline and discussing our team process in sprint retrospectives. There have been many iterations of it since we first started using it and there are still improvements to be made. For example the automated tests run against a simulated web service that run on IIS Express which is not the exact same runtime environment that the deployed processes run on. Of course the overall deploy time can always improve. As with our production code, the deployment pipeline is to be taken care of and refactored as needed.
Control the deployment
One last thing we identified impeded our release speed was the unnecessary interaction with the Ops team. At the moment we have a centralized team that supports all other teams in the organisation and it is logical that they can get stretched pretty thin at times. We decided to take the implementation of the pipeline in our hands entirely and let the Ops team just to provide and maintain the TC, OD services and AWS resources.
The main way we did that was to source control as much of the process as possible. Along with the obvious benefits that source controlling provides to any code base (historicity, accountability, easy approval, tracing of changes) it also allowed anyone in the team to immediately be able to contribute to the deployment using the same processes that we already have in place for actual code. Another benefit of source controlling those scripts and placing them in the same repository as the rest of the solution is that any changes to the deployment that arise as part of some code or configuration changes can now be included in the same git pull request. The pipeline will behave differently according the the snapshot of the code that goes though it! There are of course many things that we can’t source control in TC and OD. Some other CI tools offer a lot more functionality through configuration but replacing TC is not an option at the moment.
Of course we did not source control every single action. In OD for example we opted for default OD steps for deploying the code so that parallel deployment, configuration and retention policy is handled for us by the people that make a living writing those scripts 🙂 In any case that we had to actually do something manual though, we made sure that it would always be source controlled.
One more thing that we managed to take ownership for this way is the full deployment to production. The deployment to AWS ECS scripts were refactored (for readability mostly) and the rolling forward and back is scripted and done by the team. Ops can focus on the thing they do best and not worry about individual teams’ processes. Also integration with any external service (HipChat, JIRA) can now be experimented and improved on a lot faster.
Where we are now
Of course now we are deploying to live faster than we can finish any single feature. All of the teams at Tombola have been using feature toggles for a long time and now we really see their usefulness. We are at a place where we could push code to live many times a day and have functionality hidden until we are all certain that it is ready to be switched on and that can happen without any further deployments. Just toggling a switch on in the backend will activate it.
Our confidence in the deployed code is high as far as the main Bingo website is concerned. Our functional tests cover all main functionality and any new one that is added. There is a QA member of the team that will stretch our implementation to its limits and help us plug any remaining holes. All of these can happen in fast iterations so that we waste no time waiting for things to get ready for reviewing. We can now boast an under 20 minute deployment time to dev and stage environments and we are aiming to reduce that even further. As long as all the necessary resources and services are there we need no tickets thrown over the wall to the Ops team. The way that all of this work contributes to overall quality is that issues can now be addressed within a few tens of minutes, even if they are in production.
As a great side effect, the whole team has been involved in these operations tasks and has good experience of the whole infrastructure. We have become dare I say, a more DevOps enabled team.
We do have a few dependencies to clean up that have built up over the years into the code making the artifact managing of the CI pipeline a bit tricky. The code needs separation and independent deployment procedures. With the experience gathered from managing the deployment of the main website, this will be a lot easier in the future.
The possibilities are hard to quantify: even faster builds, builds from git branches before they are even merged, Dockerisation and running of tests in a Selenium Grid, better test coverage, adding infrastructure provisioning to the deployment, less relying on OD and more direct deployments with AWS technologies, automated deployments to inactive nodes and at some point to production… It can only get better. Whatever the path we take, we can now be confident in what we expect from our release pipeline and work towards it.