This year I was lucky enough to attend AWS re:Invent. It was hosted at the Venetian and the Mirage, Las Vegas. The scale of the event was astounding with over 32,000 delegates descending on Las Vegas to learn about AWS and hopefully gain inspiration from one of the amazing stories presented by some of the larger AWS customers.

From a the perspective of a developer, AWS re:Invent delivered hand over fist. Here is a summary of some of the more notable talks.

Werner Vogels’ Keynote

I was looking forward to attending this keynote the most. For me, Werner Vogels is to large scale distributed computing what Albert Einstein is to gravity. He still maintains his blog which is impressive for a man in his position.

His whole keynote was around transformation and he wore a pretty decent autobots shirt to back this message up.

Firstly, he talked about Transformation in Development, Testing & Operations. He talked about how companies couldn’t afford to wait week and months between releases and the evidence proved that companies who released more frequent incremental functionality improved quality and reduced risk. He talked about how automation was key in these areas and went on to announce some new AWS features. A few notable ones for me were Amazon EC2 Systems Manager, AWS Codebuild and Amazon X-Ray.

Then he turned to Transformation in Data. At this point he highlighted the evidence that whichever company leveraged their data the best would gain a competitive advantage. He argued that gone are the days where the biggest, richest companies were the only companies that had access to the resources required to make best use of their data. That data and infrastructure was available to all. He then went on to announce some new services in this space. A few notable ones for me were AWS Batch and AWS Glue

Then he turned to Transformation in compute. This was basically a whole host of announcements around AWS Lambda and container management. There was:

  • AWS Step Functions which basically give us the ability to create state machines from AWS Lambda functions
  • Lambda@Edge which gives us the ability to execute JavaScript within cloudfront edge nodes. This allows us to write basic request processing functions that execute close to the client (from a latency perspective)
  • C# Support in AWS Lambda which gives us the ability to write lambda functions using C# .Net Core 1.0. It supports nuget so will lower the level of entry to many .Net only developers.
  • Blox which is a new open source scheduler for Amazon ECS.

For me the number of announcements made was astounding and further cements our choice of AWS as our hosting platform as the correct one. It is becoming more clear that the boundaries between Operations and development is becoming increasingly blurred. Developers cannot afford to ignore AWS’ offerings, no matter the discipline. AWS is far from being just a web farm!

I hope to evaluate and blog further on some of these new technologies in the coming months so check back for updates.

From Resilience to Ubiquity – #NetflixEverywhere Global Architecture

Coburn Watson presented the Netflix story and described why they moved to the cloud and the approach they took. He admitted that Netflix were “terrible at building datacentres” so they decided to make use of AWS. He went on to describe the “Failure driven” ethos that Netflix have and how they “Never fail the same way twice”. This really resonated with me. In my opinion this is more than just making sure the same mistake doesn’t happen twice. It’s about learning lessons from failures and really understanding why a failure happened from a technical and a human stand point. It’s about using failures as a source of innovation and direction, again, both technically and personally.

Netflix take this ethos to the ultimate degree by actively disrupting their live environments with what has become to be known as their simian army. The well-known Chaos Monkey and Latency Monkey constantly navigate their live environments causing disruptions to test that the Netflix architecture can recover from instance outages and/or serious latency issues. He talked about Chaos Gorilla which does the same job as Chaos Monkey but with entire AWS regions. The most interesting for me was Conformity Monkey which scans the netflix architecture for instances that don’t conform to Netflix best practices and shuts them down, an example might be an instance that isn’t behind an ELB.

He also talked about ZUUL which gives Netflix an amazing level of fine grained control over directing traffic around their architecture. Its basically an edge service and he explained how no traffic hits an ELB without first passing through ZUUL. Its an interesting service to me as we do have some issues around routing traffic to new environments, basically because we rely on DNS for routing to an extent.

He also talked about some pretty impressive monitoring software, most notable was vizceral. This gave netflix a visual representation of traffic flow between nodes of their architecture. He used it to demo failing over a whole region.

How Toyota Racing Development Makes Racing Decisions in Real Time with AWS

Jason Chambers and Philip Loh presented the challenges faced by Toyota Racing Development in making real time race data available to the race engineers. This was a fascinating problem space in that the race cars, weather systems, race track streamed hordes of data and some of it needed to be made available to the pit crews in real time whereas some of it didn’t. All of it needed to be available to race engineers post and pre-race.

TRD made heavy use of Lambda and Firehose and S3 to stream data to the sources that needed them. Again, resiliency was key here as a loss of the data feed during a race was catastrophic for the engineers and drivers and ultimately cost the race. This is the main reason for choosing Lambda and Firehose as they are serverless technologies with almost 100% uptime.

Auto Scaling – the Fleet Management Solution for Planet Earth

Hook Hua talked about the unique approach to auto-scaling that the JPL Advanced Rapid Imaging and Analysis project took. In particular, he spoke about using a reactive policy to auto-scaling based on real earth events such as earthquakes. He talked about how AWS makes it easy for them to scale to 100,000 vCPUs where a natural event increases demand on the data processing pipelines they have. The JPL have vastly improved the availability of the analysed data but not without encountering some issues at scale.

One of the issues that JPL faced when auto-scaling to these levels was that they found they were actually a market maker in the spot market. When an auto-scaling event happened it was clear that they could cause the cost of spot instances to more than double. They learned to mitigate this issue using spot fleet to diversify resources.

Another issue that JPL faced was something they called “Thundering Herd”. When auto-scaling from 10s of instances to 1000s of instances within a short time period any services that were called during bootstrapping those instances often hit rate limits or even DDoS’d those services. The solution to this was to “Jitter” the API calls. All API calls during bootstrapping instances had a random wait before the call was made.

Conclusion

The general message from re:Invent this year was security and resiliency at scale (where better to get this message over than Las Vegas). To achieve this, automation is key. We should be developing our “physical” architecture in the same way as we develop our software. We must stop thinking about infrastructure as static and never changing. AWS offers us almost infinite scale and we should always make sure we are in the best possible place to make use of this scale when needed because this is what will give us an advantage over our competitors.

  • Neil Trodden

    Great write-up! If I could ‘vote’ for a future blog-post – those Step Functions look really interesting.