For the past few months, my team have been working on a project to improve the way we host some of our smaller services and tasks. We have a growing portfolio of small services and tasks (dare I say micro-services…) that had been deployed in a less than efficient manner. Both in terms of maintainability and cost. We investigated a few options to see if we could come up with a way to streamline this process of developing, delivering and maintaining these services but eventually we settled on docker and ECS.

Docker in its own right was extremely promising to us because of the analogy that docker themselves use to describe the benefits of deploying applications in containers. A simplified version of that analogy is that containers standardize the process of getting your containerized cargo from point A to point B; we do not need to know what is containerized to be able to ship containers. This idea (apparently) revolutionized the shipping industry and is making waves in the software industry for exactly the same reasons. While the analogy is good enough, it is not perfect. As you will see by reading on.

The first service we approached was our logging service. A very simple Node.js API that our clients use for logging and diagnostics. Dockerizing this API was extremely simple first we created the DockerFile as follows:

 

To build the image:

 


To run the app locally (or indeed in any environment):

 


This is great! Everything contained. We can configure everything we want in code without any heavy machine setup. So how do we get this into AWS? AWS have a platform for exactly this purpose – ECS.

ECS

ECS is the AWS container orchistration solution. It has tools and APIs for managing the deployment and running of containers. I won’t go into all of the details right now,
if you want to read more you can do here. What I will say though is that ECS has some of its own concepts/language around the management of containers
I’ll explain a few of them.

Task Definition

In ECS a task definition is a description of the resources required for one or more tasks to run and how they are related to each other.
It can be thought of as roughly equivalent to a docker-compose file. In a task definition you describe links, volumes, containers, networking etc.

Task

A task is an instance of an application stack started from any given task definition. You can think of it as a stack created from a given docker-compose file.

Service

A service is the management wrapper for on or more of your tasks. It’s the Service that links your tasks to the rest of the AWS infrastructure. The service describes any auto scaling policies, load balancing, placement strategies and also manages restarting/replacing failed tasks. The deployment of an updated task is managed at this level from a new task definition.

Back to our logging API…
Once we had our logging service up and running on ECS we were ready to switch over our client traffic. This is the point where we realised that we had a problem in terms of how we managed our incoming traffic.
Our problem was that the old service was an elastic beanstalk application (not a problem on its own) and public traffic was hitting the beanstalk load balancer directly. The only control we had over this traffic was using DNS.

High level view of the current logging setup

High level view of the current logging setup

Our only option at this point was to do a DNS switch to the new logging service that is deployed to ECS

How we dont want to switch traffic

How we don’t want to switch traffic

This would be risky though. It could take up to 48 hours before we could be sure that all traffic was where we needed it to be. In addition, rolling back would be just as sluggish and unpredictable. So at this point, we realised that what we needed was an edge service of some description. Something that would allow us to do the following:

How we would like to switch traffic

How we would like to switch traffic

At this point, we could wait the 48 hours that might be necessary but all of our traffic would still end up at the same service internally. What we’ve introduced is something that will give us full control over where our traffic ends up without having to rely on DNS. Once we think were ready to switch we do the following:

Switching using the proxy

Switching using the proxy

Now all of our traffic is immediately hitting our new service and if there is a problem, we can immediately roll it back. Once were happy we can remove the legacy service completely.

Removing the legacy service

Removing the legacy service

This was something we decided we needed. We didn’t, however, want to couple it with the logging service. So what should we use?

Spring Cloud and Netflix Zuul

While attending a talk by Coburn Watson of Netflix at AWS re:invent in Las Vegas 2016 I decided to investigate the use of zuul as a potential candidate. I did investigate other solutions like NGINX and HAProxy but zuul looked much more lightweight and easier to extend and mutate being essentially a java app. Also, spring cloud publish a zuul library that makes it extremely easy to get up and running with a simple zuul implementation that had the proxy-ing functionality we needed (more info on spring cloud netflix here).
What we ended up with was an app that was essentialy a couple of lines of code and a couple of lines of config:

 

and the config:

 


This gave us exactly what we needed. We could deploy this as a service at the edge of our network and have it behave as a proxy for our logging service. Moving the traffic from the legacy logging API to the ECS version would take as long as it takes to update zuuls config.

The only problem is that we would be blind as to the level of traffic that the proxy was handling. We needed to have visibility of the throughput and health of the logging service particularly for the initial deployment.

Hystrix

Another library available from spring cloud and netflix was Hystrix. Hystrix is essentially a command wrapping library that provides circuit breaking and resiliency functionality for any command it wraps.
Zuul makes it possible for all proxied calls to be wrapped in hystrix commands without writing any code when using the config syntax above. When using this configuration syntax spring cloud also creates a /hystrix.stream endpoint on zuul which is an SSE stream of the current state of all services being proxied in terms of performance and circuit health. Hystrix streams can be visualized in a hystrix dashboard which is a tool developed by netflix and enabled in spring by doing the following:

 

Once an app is started with this attribute you can navigate to /hystrix and be presented with the Hystrix dashboard.

Hystrix Dash

Hystrix Dash

Here you can paste in the URL to your /hystrix.stream endpoint and watch zuul processing requests for your services in real time.

Hystrix with a single host

Hystrix with a single host

Turbine

Obviously, we wanted the option to let the proxy scale out and in based on the levels of traffic that it processes. How do we keep track of all of the different /hystrix.streams? Again, Netflix have already thought of this in a library called turbine and spring have made integration into a spring app easy. Turbine is a Hystrix stream aggregator. You let Turbine know where your Hystrix streams are and it will aggregate them and give you another stream that will give you an aggregated view of all of your proxies.

Hystrix monitoring multiple hosts via turbine

Hystrix monitoring multiple hosts via turbine

Turbine automatically discovers any Hystrix streams using Eureka. Eureka is a Netflix developed service discovery service that integrates with all of things very easily, particularly when using spring cloud.

Conclusion

What we ended up with was a powerful, flexible, extensible, elastic edge service that is easy to monitor in real time. It allows us to manipulate our internal traffic with precision using filters that we can run our traffic through based on any rule we wish.
It also gives us a single consistent logical entry point into our network. We can choose where to send traffic internally based on any rule we want so we can re-route traffic without the client needing to be aware.
It is possible to use this edge service for far more than just proxying and I recommend that you take a look at the Netflix Zuul repo in github for some interesting use cases.