At tombola, we use Amazon’s ECS (Elastic Container Service) to schedule and orchestrate many of our services. While this has many benefits to traditional application hosting, it also brings new challenges – Not least of all monitoring.
Picture this scenario: You have an application running on a cluster of machines, those machines are completely redundant; and so are the application processes themselves. Instances of your application process can be ran on any machine in the cluster, you could have multiple processes of the application running on the very same machine – Don’t care. What we DO care about, is making sure everything is healthy, and when it isn’t we want to know about it – Monitoring, right?
I’m not talking about application monitoring as in: logging, transaction performance, stack traces etc… I’m instead referring to container and infrastructure monitoring.
We made a tool for that.
An AWS ECS monitoring tool, would you believe. It’s open source and developed on github. It essentially offers service aggregation over the AWS API’s which allows us to view metrics, logs and events which happen on the ECS scheduler and in containers as quickly as possible.
Why did we make a tool?
Currently, Amazon don’t offer anything like this. ECS itself is nothing more than a scheduler (and an agent running on EC2 instances) and leave much of the tooling up to you. There are many commercial options which promise great insight into your Docker applications but we weren’t ready for another piece of software which needs to run on our agents, and really; AWS has all the raw data we need, it’s just not in the best format to be used as a dashboard.
What is the ECS monitor good for?
The monitor was built with dashboarding in mind, we want to put this on the TV for the whole team to see, so there were a several things we wanted to monitor which we found valuable:
As I mentioned earlier one of the primary benefits is aggregation. Amazon currently don’t have a place where you can view service events all together in one place. So this was one feature we wanted, an aggregated service event stream.
The ECS scheduler always tries to keep the number of running tasks to the desired count. It will do this forever, regardless of whether or not the containers are unhealthy and being killed frequently; ECS will keep creating more in an attempt to reach and maintain that desired count. When this happens, it can be pretty tricky to track down. It’s all too easy to look in the AWS console and see a healthy service, when in reality it’s spamming tasks.
Task churn is a component in the monitor which detects this. The monitor listens for task start events on each service and applies a time buffer to the count, if it’s over a threshold an alert is raised and the service is flagged as being sick – It should be investigated.
Being able to visualise your tasks running on a cluster is useful, particularly if you have specific placement strategies and constraints set on your services. We’ve found it valuable to track the spread of tasks across a cluster. Another scenario where this is useful is when an instance becomes sick, you can quite easily detect which one is playing up and kill it. Like all the features in the ECS monitor this updates with the latest state as often as AWS allows. For example, when the scheduler registers a new task, that task will be displayed on this cluster visualisation indicating a pending state without having to refresh or take any action.
If you are using CloudWatch to hold container logs, the ECS monitor lets you view and query log streams. It doesn’t offer anything over the AWS console, only it’s quicker to access, as 90% of the time you’re only going to be diving in there for debugging purposes.
Cluster and service metrics are another thing we want to keep an eye on. So we have put these in too, with cluster metrics over time represented on charts and realtime counts (as close to realtime as AWS allows anyway). The monitor requests this data as frequently as the Amazon API allows.
When a new deployment is detected that is also raised as an alert and displayed in the monitor too.
You can check out the github issues to see upcoming features, but if I were to list a few of the big hitters:
- Alerts – If we leave the monitor on overnight, it would be nice to come in on the morning and quickly see if any alerts were raised overnight which might indicate a problem.
- Compact mode – When there are many services, the current layout gets out of hand. We want to offer an alternative view – A ‘not so detailed’/‘essentials only’ view but one that puts more emphasis on alerts rather than showing all of the details. Much better suited to accounts with many services. All in aid of making this a good tool for dashboards.
- EC2 data – Data and metrics on the EC2 agents themselves, stats such as CPU credits remaining/usage; network in/out and more. We’ve found it’s generally not that important to be watching these metrics constantly, but it’s another great piece of data to diagnose the health of a cluster.
Open source and taking contributions
We find this very useful internally, hopefully others may find a use for it and would like to contribute to it’s evolution, issues and pull requests are always welcome!