I remember the look of pure joy on our developer’s faces when they first discovered AWS. “Jason, you don’t get it,” they said, “I can get my hands on a hundred new servers, put them in any AWS region in the world with the push of a button, and I only pay for what I use!” And so it began, a previously difficult and time consuming task of getting access to servers and disk became a very simple one.
As we’ve moved from the early days of cloud to mainstream adoption, it’s clear that what started out simple has become quite a challenge to manage. Talk to most enterprise teams working in AWS for over a year and they’ll agree: The cloud can be extremely hard to operationalize at scale.
Why is Cloud Configuration so Complex?
What are the factors that contribute to this operational complexity? There are many reasons getting cloud infrastructure right is difficult, but one of the key pieces is proper cloud provider configuration. And as far as configuration is concerned, we see three key contributors to its complexity.
Complexity #1: Service Sprawl and Inter-Dependencies
- Hundreds of services, thousands of configurations, and interrelations you may not know about.
Complexity #2: More Cloud-Native and Dynamic Applications
- Designed to fail is a good thing, but it introduces even more configuration and code.
Complexity #3: Growing Cloud Teams (People Don’t Know What They Don’t Know)
- Cloud is still constantly changing, so experience is often gain through trial by fire.
This is the first in a three part series to discuss the challenges of managing cloud infrastructure growth, so today we’ll focus on Complexity #1.
Complexity #1: More Services and Inter-Dependencies
One of the greatest advantages of cloud is the immediate access to a broad array of services, and every year, Amazon introduces a dizzying array of new capabilities. Today, a quick glance reveals hundreds of services (and thousands of potential configuration settings) all managed with APIs, point-and-click, and/or various automation tools. Configuration is king, and in AWS, there is a lot of it!
It used to be that security and access control were handled by the five guys who had keys to the data center and the two people with credentials to the firewall. In the new world, fine-grained, per service controls are required. Engineers have been given access to systems previously reserved for highly trained (or at least very experienced) individuals that had only a few holes that needed to be maintained or plugged. The mantra was “behind the firewall, we are safe,” but now there is no more firewall. The promise of the cloud has always been the ability to move faster and offload the burden of datacenter management to the people whom, let’s face it, are way better than you will ever be at managing a datacenter. The benefits far outweigh the negatives…unless of course, you—or any one of the people on your team—make a mistake. In this new world, you must ensure that security groups, VPCs, Network ACLs, IAM policies, S3 buckets, your lambda magic and any other great new things that you want to use in the cloud, are always wired up just right. These services can be hard to configure correctly on their own, much less keep them that way.
For example, following best practices for S3 configuration requires that you set an alarm for any changes to any S3 bucket policy. (You’d like to know about any change of state for access to data.) In order to do this correctly, you have to check 4 different services (Cloudtrail, Cloudwatch, Cloudwatch logs, and SNS) to get the complete picture and verify your setup. And doing this once is not enough. How do you know that the person you laid off for budgetary reasons is not malicious? Are you sure that S3 policy hasn’t been modified to allow the world to access it? Is there a role somewhere in your thousands of roles that allows a malicious account access to your company secrets? Are all of those RDS databases encrypted? Are they publicly accessible?
If you are part of a typical organization, you started off with a single service, perhaps two. S3 was your data store and you started launching EC2 servers to experiment. Most likely you or your team launched through a web console without really knowing what was going on. That was okay, because you knew you were going to throw the servers away later, but that probably didn’t happen and now those experimental servers are still running today. As your application added a broader array of cloud services (like Dynamodb, SQS, Lambda, and more) you soon discovered your application was not just a box anymore! It’s now a deeply nested group of services and infrastructure, wired up into a smarter system, and many of the resources that you created when you and your team were cloud-naive are a core part of that system. Were they all set up correctly? In the months since you launched them, have new vulnerabilities been discovered that you need to address?
With all of this new sophistication and complexity, a simple misconfiguration can have a ripple effect that can be extremely difficult to recover from. If someone accidentally deletes a load balancer or Route53 entry…good luck trying to update all the related objects to bring your service back on online quickly. It’s not easy to determine where those DNS entries need to be propagated. A resilient system understands the downstream effects of a change and will move quickly to resolve the issues. In a single-person team, this may be relatively straightforward—perhaps even a two-person team can manage the task with some effort. Large teams, however, face the impossible task of everyone knowing everything that is going on with the rest of the group. Not only is it difficult, but it’s a waste of time and energy.
How to Combat Misconfigurations in the Cloud
Knowledge gaps are a huge reason for mistakes and omissions, and the first step in fighting misconfigurations is to follow documented best practices outlined by AWS. While training and experience is key, cloud providers evolve their platforms so quickly that having someone who understands all the services and relationships between them is nearly impossible. With more configuration and risk, the need for continuous checking and governance over the entire cloud environment becomes even more critical. Security and correct configuration is not determined by a point-in-time. Every minute between verifications is a minute where drift, mistakes or malicious intent can affect your cloud environments.
CloudCoreo’s approach is to continually find and fix these risks on an ongoing basis. This includes understanding dependent cloud objects and propagation of changes through the entire system and lifecycle (all of which we do through code that lives in your team’s git repository).
To get started, we have created a Cloud Scan that reviews your AWS account and reports back any best practice violations we find, like world readable S3 buckets, open ports, or issues related to IAM user access privileges. We’ve found, no matter how experienced the cloud team, there’s always some unpleasant surprise hiding in your infrastructure.Learn About Our Free Cloud Audit
Next time we’ll dig into Complexity #2: More Dynamic Applications.