By Adrien Payong and Shaoni Mukherjee
Auto scaling is a cloud computing technique in which the amount of computational resources in a hosting environment is dynamically adjusted based on its current workload. This allows an application’s supporting infrastructure to grow automatically when demand increases and shrink when demand subsides. Adding and removing resources manually is a process susceptible to human error and is likely to result in a situation where too few resources are available to support current demand (leading to slow performance or outages) or too many resources are in use(leading to waste of money and other resources). Auto scaling helps to address this issue by automatically adding and removing resources as required to maintain performance while optimizing costs.
In this post, we’ll explain how auto scaling works and why it’s important. We’ll examine different types of auto scaling and scaling policies. You’ll also see how major cloud providers support auto scaling, how Kubernetes autoscaling works, common pitfalls, and how to avoid them.
Auto scaling operates by tracking your application’s performance or load metrics, then automatically taking automated actions to add or remove resources when specified conditions are met. A high-level overview of the process is as follows:
Behind the scenes, the auto-scaling service for each provider will manage the details. In all cases, the pattern is the same: monitor -> trigger -> scale action -> stabilize -> repeat, all based upon your defined policies.
When discussing auto scaling, it’s important to understand the two fundamental ways a system can scale:
The major cloud providers provide auto-scaling. Applications are kept responsive and cost is optimized by automatically scaling virtual machines or containers as required.
Amazon Web Services provides different types of auto-scaling services:
Auto Scaling can be managed via the AWS Management Console, AWS CLI, SDKs, or infrastructure-as-code tools such as CloudFormation. With a CloudFormation template, you can specify an AWS::AutoScaling::AutoScalingGroup resource along with the essential properties such as MinSize, MaxSize, and the network subnet where you want to place your instances. You then associate the group with an AWS::AutoScaling::ScalingPolicy, which determines when and how scaling occurs. For example, you may want to set a policy to maintain average CPU utilization at about 50%, with a cooldown period of 5 minutes to prevent quick scaling decisions. The following YAML code snippet shows a sample configuration:
Resources:
MyAutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
MinSize: '2'
MaxSize: '20'
DesiredCapacity: '2'
VPCZoneIdentifier:
- subnet-xxxxxxxxxxxxxxxxx # Specify your subnet ID(s) here
LaunchTemplate:
LaunchTemplateId: !Ref MyLaunchTemplate
Version: !GetAtt MyLaunchTemplate.LatestVersionNumber
MyCPUScalingPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AutoScalingGroupName: !Ref MyAutoScalingGroup
PolicyType: TargetTrackingScaling
TargetTrackingConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ASGAverageCPUUtilization
TargetValue: 50 # Maintain 50% CPU utilization
Cooldown: '300' # 5-minute cooldown
MyLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateData:
ImageId: ami-0c55b159cbfafe1f0 # Example Amazon Linux 2 AMI
InstanceType: t2.micro
Auto Scaling Group will automatically adjust the size of EC2 instances to meet demand and maintain optimal CPU utilization, scaling out or in as needed.
Microsoft Azure provides a feature called virtual machine scale sets (VMSS) that can be used to scale virtual machines horizontally. It also includes an integrated Azure Autoscale service that can be used with other resources such as app services, cloud services, and more. Azure is more focused on the tight integration with other Azure components and strong hybrid cloud support (the same scaling strategy is used for scaling both on-premises and cloud deployments). For example, the VM scale sets enable you to deploy a set of identical VMs that can automatically scale (out or in) based on metrics or schedules.
Google Cloud Platform features managed instance groups, which can be used to autoscale virtual machine instances. Google Cloud Autoscaling is also well known for its ability to scale a containerized environment via Google Kubernetes Engine (GKE), which provides automatic scaling of clusters and pods. Managed instance groups can autoscale based on a wide variety of signals (CPU, HTTP load balancing, queue metrics, etc.). They can also be integrated with GKE to handle autoscaling of the underlying cluster. There is an initialization period (cooldown) for GCP autoscaler during which it temporarily ignores metrics from newly created instances.
DigitalOcean has introduced an auto-scaling feature for cloud services. Droplet autoscale pools allow automatic (add/remove) scaling of the number of Droplet (VM) instances in a pool based on CPU or memory usage, providing managed horizontal scaling for DigitalOcean cloud services. For example, you can define a pool of web server droplets to maintain a steady CPU usage of 60% and the pool will automatically scale out/in as needed. CPU-based auto scaling is also available in DigitalOcean App Platform, which automatically adds/removes application components (containers) based on a CPU usage threshold.
Kubernetes is a container orchestration platform that provides auto-scaling capabilities. It is widely used in cloud environments (and sometimes in on-premises environments as well). There are two planes of scaling within Kubernetes:
Horizontal Pod Autoscaler
Horizontal pod autoscaler (HPA) is a Kubernetes controller that automatically scales the number of pod replicas for a workload (Deployment, ReplicaSet, StatefulSet, etc.) depending on the observed metrics.
It allows you to scale the containerized application pods horizontally. Let’s consider you have a deployment with 2 pods of a web application. An HPA can monitor their CPU usage and increase the number of replicas to, for example, 5 pods if the CPU usage is high, and scale down to 2 when idle.
By default, HPA scales based on CPU utilization (reported by the Kubernetes metrics server that collects the CPU/memory usage from nodes). It can also be configured for scaling based on memory usage (through the metrics server) and custom metrics (with the autoscaling/v2 API).
You will build a HorizontalPodAutoscaler resource definition in YAML (or using kubectl autoscale). It will contain the target deployment (or other workload controller), a minimum and maximum number of replicas to scale between, and the target metric threshold. Here’s an example of an HPA configuration targeting CPU usage:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
labels:
app: my-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
The Kubernetes system will maintain a minimum number of pods for the my-app deployment at 2 while limiting the maximum to 10. The number of pods will scale up (add) or scale down (remove) to keep an average CPU utilization across all pods of ~70%. If CPU consumption per pod crosses the threshold, more pods should be added (up to a maximum of 10 here).
If the target percentage is less than 70% and the pods are underutilized, the number of pods will decrease (but not less than 2 as set by minReplicas). It is worth noting that the HPA checks for metrics at defined intervals. The interval is controlled by the horizontal-pod-autoscaler-sync-period parameter, which sets the frequency of HPA’s scaling decisions updates based on the observed metrics. It also takes into account a stabilization window to prevent it from frequently scaling up and down.
Cluster Autoscaler (CA)
Let’s consider that our Kubernetes cluster is out of capacity (no free CPU/Memory on any node to schedule new pods). In this case, HPA cannot add more nodes on its own; this is where the cluster autoscaler comes in. Cluster Autoscaler runs as a component of a Kubernetes cluster, interacting with the cloud provider (or other infrastructure) to add/remove worker nodes (VMs) in the cluster based on the pending pods. Here are some key essential points for managing Kubernetes autoscaler:
Best Practices: Each node group (pool) should contain homogeneous instance types, and you typically label the node groups that can scale. The autoscaler follows Pod disruption budgets rules to prevent terminating essential pods. It also won’t scale down if that would violate a pod’s requirements (such as a pod that can’t be moved).
Many cloud-managed Kubernetes offerings take care of the cluster autoscaler for you once it is enabled (e.g., DigitalOcean Kubernetes (DOKS) provides a built-in autoscaler for node pools as well).
Auto scaling can be a confusing subject when trying to differentiate between manual provisioning, policy-driven system scaling, and the cloud-native notion of elasticity. In the table below, we will break down the differences between manual scaling, auto scaling, and elastic scaling. This will allow you to understand how these three concepts differ in operation, triggers, precision, and even some real-world examples. DevOps and cloud teams can use these key differentiators to decide which is most appropriate for their workloads.
Scaling Approach | Definition | Trigger Mechanism | Speed & Precision | Drawbacks / Risks | Cloud Examples |
---|---|---|---|---|---|
Manual Scaling | Human-driven adjustments via console or scripts, often involving static or reactive provisioning. | Triggered manually by operators (console, CLI, tickets) | Slow (minutes–hours), often imprecise and conservative | Labor-intensive, error-prone, and costly due to over-provisioning | Manually resizing EC2 or VM instances |
Auto Scaling | Automated policy-driven scaling based on rules, metrics, or schedules. | System-managed via thresholds, target tracking, or scheduled events | Fast (seconds–minutes), consistent, and efficient | Requires accurate configuration; misconfigured policies may lead to instability or excessive cost | AWS Auto Scaling Groups, Azure VM Scale Sets, GCP Managed Instance Groups, Kubernetes HPA/VPA |
Elastic Scaling | The broader concept of matching resources to demand in real time, auto-scaling often supports this behavior. | Implicit in the platform, typically built into serverless or PaaS services | Near-instant (sub-second to seconds), highly efficient | Limited manual control; scaling logic is abstracted away | AWS Lambda, Azure Functions, Google Cloud Functions, DigitalOcean App Platform |
Manual scaling will have a slower response time because it depends on a human to trigger scaling activities. Auto scaling is pre-emptive and reactive. It comes with system-defined scaling rules that trigger expansion quickly, and should be able to shrink as quickly as the underlying infrastructure allows. Elastic scaling is the cloud-native nirvana: resources automatically and instantly grow and shrink, driven behind the scenes by autoscaling engines.
Auto scaling policies define the WHEN and HOW of your scaling activity. The right combination of policies will keep the system ultra-resilient during spikes, cost-optimized during downtime, and highly-responsive to fluctuating workloads.
Policy Type | Trigger / Mechanism | Pros | Cons / Caveats | Supported By |
---|---|---|---|---|
Dynamic (Reactive) | Monitors real-time metrics (CPU, memory, latency, queue length). Threshold-based: e.g., “CPU > 70% for 5 min → +2 instances; CPU < 20% for 10 min → –1 instance.” Target tracking: maintain a target value using algorithmic adjustments. | Responds to sudden load spikes. Highly automated and granular | May lag during abrupt surges. Requires careful tuning of thresholds and cooldowns | AWS ASG, Azure Monitor Autoscale, GCP MIG, Kubernetes HPA |
Scheduled Scaling | Time-based actions (e.g., “Every weekday at 8 AM, add instances”). Acts like a cron job for scaling. | Ideal for predictable load cycles. Ensures capacity is ready ahead of time | Cannot react to unexpected events. Requires accurate forecasting of usage | AWS Scheduled Actions, Azure schedule rules, GCP scheduled autoscaling, DigitalOcean Pools |
Predictive Scaling | Uses ML on historical data to forecast demand and proactively scale (e.g., 15–60 min ahead). | Reduces scaling lag. Optimized for recurring patterns. | Requires ≥1–2 weeks of data. May mispredict irregular events | AWS Predictive ASG, Azure Predictive VMSS, GCP Predictive MIG |
Manual (Fixed) | Human-managed; auto scaling disabled or adjusted manually—common during maintenance or debugging. | Full control when needed. Useful during emergencies | No automatic responsiveness. Can lead to inefficient resource usage | Supported by all major clouds as a baseline |
Auto scaling makes cloud management easier, but it can cause problems if not set up correctly. Here are some common issues to look out for:
Adding too many or too few resources
Issue: If your scaling policies are not defined correctly, you may end up with overprovisioning of resources (for example, adding too many servers), leading to wasteful costs. It can also lead to underprovisioning of servers, leading to performance issues.
Best Practice: You can test and fine-tune your settings and scale up/down to the right size according to your needs.
Sudden load can lead to delayed scaling
Issue: Auto scaling may respond slowly to a sudden increase in load, due to a delay in monitoring the resources or a slow process when adding more servers.
Best Practice: Containers are faster when it comes to scaling, since you can configure them to spin up in seconds. You can also pre-plan your scale (up or down) for events you know will have high traffic.
Compatibility issues with legacy systems Issue: Legacy applications may not be configured to scale horizontally or to interact with orchestration systems. Therefore, auto-scaling such systems may lead to instability and errors. Best Practice: You can test your workloads and their dependencies to determine if they are cloud-native and stateless before implementing auto-scaling. Legacy applications could be refactored for scalability, where possible. You can also use manual scaling for any components that cannot be modernized.
Auto scaling is the automated process of adjusting computing resources (VMs, containers, servers) in real time to match the actual workload. It’s used to maintain application performance and optimize costs.
Horizontal scaling involves adding/removing instances (scale out/in), improving fault tolerance and parallelism, while vertical scaling involves increasing/decreasing resources on a single instance (scale up/down).
Triggers are typically metrics such as CPU usage, memory usage, request rate, queue length, or custom application-specific metrics.
Auto scaling is a cloud service that dynamically adjusts the resources for your application based on the actual demand. It can help you avoid under- or overprovisioning and ensure that your app can handle incoming traffic. Manual scaling is slow and prone to human error, while elastic scaling provides the gold standard of resource management. Auto scaling provides the perfect balance, allowing modern applications to operate efficiently at any scale. Follow best practices, avoid common pitfalls, and customize auto scaling policies for your workload. When configured properly, auto scaling helps deliver reliable performance and optimize cost. This will enable you and your teams to focus on innovation, not infrastructure.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.