Amazon is probably the biggest cloud provider in the industry – they certainly have the most features and are adding more at an amazing rate.
Amongst the long list of services provided under the AWS (Amazon Web Services) banner are:
- Elastic Compute Cloud (EC2) – scalable virtual servers based on the Xen Hypervisor.
- Simple Storage Service (S3) – scalable cloud storage.
- Elastic Load Balancing (ELB) – high availability load balancing and traffic distribution.
- Elastic IP Addresses – re-assignable static ip addresses to EC2 instances.
- Elastic Block Store (EBS) – persistant storage volumes for EC2.
- Relational Database Service (RDS) – scalable MySQL compatible database services.
- CloudFront – a Content Delivery Network (CDN) for serving content from S3.
- Simple E-Mail System (SES) – for sending bulk e-mail.
- Route 53 – high availability and scalable Domain Name System (DNS).
- CloudWatch – monitoring of resources such as EC2 instances.
Amazon provides these services in 5 different regions:
- US East (North Virginia)
- US West (North California)
- Europe (Ireland)
- Asia Pacific – Tokyo
- Asia Pacific – Singapore
Each region has it’s own pricing and features available.
Within each region, Amazon provides multiple “Availability Zones”. These different zones are completely isolated from each other – probably in separate data centers, as Amazon describes them as follows:
Q: How isolated are Availability Zones from one another?
Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone.
However, unless you have been offline for the past few days, you will have no doubt heard about the extended outage Amazon has been having in their US East region. The outage started on Thursday, 21st April 2011) taking down some big name sites such as Reddit, Quora, Foursquare & Heroku and the problems are still ongoing now, nearly 2 days later – with Reddit and Quora still running in an impaired state.
I have to confess, my first reaction was that of surprise that such big names didn’t have more redundancy in place – however, once more information came to light, it became apparent that the outage was affecting multiple availability zones – something Amazon seems to imply above shouldn’t happen.
You may well ask why such sites are not split across regions to give more isolation against such outages. The answer to this lies in the implementation of the zones and regions in AWS. Although isolated, the zones within a single region are close enough together that low cost, low latency links can be provided between the different zones within the same region. Once you start trying to run services across regions, all inta-region communication will go over the normal internet and is therefore comparatively slow, expensive and unreliable so it becomes much more difficult and expensive to keep data reliably syncronised. This coupled with Amazon’s above claims about the isolation between zones and best practises has lead to the common setup being to split services over multiple availability zones within the same region – and what makes this outage worst is that US East is the most popular region due to it being a convenient location for sites targeting both the US and Europe.
On the back of this, there are many people are giving both Amazon and cloud hosting a good bashing in both blog posts and on Twitter.
Where Amazon has let everyone down in this instance is that they let a problem (which in this case is largely centered around EBS) to affect multiple availability zones and thus screwing everyone who either had not implemented redundancy or had followed Amazon’s own guidelines and assurances of isolation. I also believe that their communication has been poor and had customers been aware it would take so long to get back online, they may have been in a position to look at measures to get back online much sooner.
In reality though, both Amazon and cloud computing less to do with this problem and more specifically the blame associated with it. At the end of the day, we work in an industry that is susceptible to failure. Whether you are hosting on bare metal or in the cloud, you will experience failure sooner or later and part of the design of any infrastructure you need to take that into account. Failure will happen – it’s all about mitigating the risk of this failure through measures like backups and redundancy. There is a trade-off between the cost, time and complexity of implementing multiple levels of redundancy verses the risk of failure and downtime. On each project or infrastructure setup, you need to work out where on this sliding scale you are.
In my opinion, cloud computing provides us an easy way out of such problems. Cloud computing gives us the ability to quickly spin up new services and server instances within minutes, pay by the hour for them and destroy them when they are no longer required. Gone are the days of having to order servers or upgrades and wait in a queue for a data center technician to deal with hardware. It was the norm to incur large setup costs and/or get locked into contracts. In the cloud, instances can be resized, provisioned or destroyed in minutes and often without human intervention as most cloud computing providers also provide an API so users can handle the management of their services programatically. Under load, instances can be upgraded or additional instances brought online and in quiet periods, instances can be downgraded or destroyed, yielding a significant cost saving. Another huge bonus is that instances can be spun up for development, testing or to perform an intensive task and thrown away afterwards.
Being able to spin new instances up in minutes is however less effective if you have to spend hours installing and configuring each instance before it can perform it’s task. This is especially true if more time is wasted chasing and debugging problems because something is setup differently or missed during the setup procedure. This is where configuration management tools or the ‘infrastructure as code’ principles come in. Tools such as Puppet and Chef were created to allow you to describe your infrastructure and configuration in code and have machines or instances provisioned or updated automatically.
Sure, with virtual machines and cloud computing, things have got a little easier by easily allowing re-usable machine images. You can setup a certain type of system once and re-use the image for any subsequent systems of the same type. This is however greatly limiting in that it’s very time consuming to then later update that image with small changes, to cope with small variations between systems and almost impossible to keep track of what changes have been made to which instances.
Configuration Management tools like Puppet and Chef manage system configuration centrally and can:
- Be used to provision new machines automatically.
- Roll out a configuration change across a number of servers.
- Deal with small variations between systems or different types of systems (web, database, app, dns, mail, development etc).
- Ensure all systems are in a consistant state.
- Ensure consistency and repeatability.
- Easily allow the use of source code control (version control) systems to keep a history of changes.
- Easily allow the provisioning of development and staging environments which mimic production.
As time permits, i’ll publish some follow up posts which go into Puppet and Chef in more detail and look at how they can be used. I’ll also be publishing a review of James Turnbull’s new book, Pro Puppet which is due to go to print at the end of the month.