Spot is the single largest lever on an EC2 bill — routinely 70-90% off On-Demand for the same hardware — and most teams either avoid it in production out of reclaim anxiety or run it so naively that one capacity event takes out a meaningful slice of the fleet. Neither is necessary. Spot reclaims are a function of how many distinct capacity pools you draw from and how you react to the two-minute notice. This is the playbook I use to put interruption-tolerant production workloads on Spot at scale: diversified mixed instances policies, the allocation strategy that actually minimizes interruptions, the On-Demand floor that keeps you safe, and the drain machinery that turns a reclaim into a non-event. It builds on the Auto Scaling fundamentals (launch templates, lifecycle hooks, instance refresh) covered in the warm pools article; here the focus is purchase options and resilience.
1. Spot mechanics: pools, the two-minute notice, and rebalance
A capacity pool is one combination of (instance type, Availability Zone) in a Region — m6i.large in us-east-1a is a different pool from m6i.large in us-east-1b, and from m6a.large in us-east-1a. Spot prices and availability are set per pool, and EC2 reclaims Spot instances in that pool when it needs the capacity back. This single fact drives the diversification strategy: if your whole fleet sits in one pool, one reclaim hits everything; spread across twenty pools, a reclaim trims a few percent.
Two signals warn you before an instance dies, both delivered through instance metadata (IMDS) and EventBridge:
- Rebalance recommendation — an early, best-effort heads-up that this instance is at elevated risk of interruption. It can arrive minutes before the termination notice, giving you time to launch a replacement and drain proactively. It is advisory; not every recommendation is followed by an interruption.
- Spot interruption notice — the hard two-minute warning. You then have ~120 seconds before the instance is stopped or terminated. Last chance to drain.
You read the interruption notice from IMDS on the instance itself. With IMDSv2 (which you should be enforcing), that is a token-authenticated request:
# On the instance. Returns 200 + JSON only when a notice is pending; 404 otherwise.
TOKEN=$(curl -sX PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 30")
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/spot/instance-action
# => {"action":"terminate","time":"2026-06-08T14:22:00Z"}
Mental model: you do not “prevent” Spot interruptions. You make them statistically rare (diversification + capacity-optimized allocation) and operationally boring (rebalance + a fast, idempotent drain). Design for the reclaim and it stops being scary.
2. Designing a diversified mixed instances policy
The mixed instances policy lets one group pull from many instance types and blend On-Demand with Spot. Diversification is the whole game: more pools means lower interruption rate and faster scale-out, because capacity-optimized allocation has somewhere to go when a pool dries up.
Build the type list across three axes:
- Families —
m6i,m6a,m5,m5n. AMD (a) and Intel (i) variants are near-identical for most workloads and double your pool count for free. - Sizes —
large,xlarge,2xlargeof equivalent total capacity, using capacity weights so the group reasons in vCPU units rather than instance count. - AZs — every AZ your subnets cover. The subnet list multiplies every type into a new pool per AZ.
resource "aws_autoscaling_group" "web" {
name = "web"
min_size = 6
max_size = 120
desired_capacity = 6
vpc_zone_identifier = var.private_subnet_ids # spread across >= 3 AZs
health_check_type = "ELB"
health_check_grace_period = 90
capacity_rebalance = true # see section 5
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 2
on_demand_percentage_above_base_capacity = 20
spot_allocation_strategy = "price-capacity-optimized"
spot_max_price = "" # empty = cap at On-Demand price (correct default)
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.web.id
version = "$Latest"
}
# Weighted so the group thinks in "8 vCPU units", not instance count.
override { instance_type = "m6i.2xlarge" weighted_capacity = "8" }
override { instance_type = "m6a.2xlarge" weighted_capacity = "8" }
override { instance_type = "m5.2xlarge" weighted_capacity = "8" }
override { instance_type = "m6i.xlarge" weighted_capacity = "4" }
override { instance_type = "m6a.xlarge" weighted_capacity = "4" }
override { instance_type = "m5n.xlarge" weighted_capacity = "4" }
override { instance_type = "m6i.large" weighted_capacity = "2" }
override { instance_type = "m6a.large" weighted_capacity = "2" }
}
}
}
With weights, desired_capacity = 6 means six units, not six instances — the group can satisfy it with three larges or one 2xlarge plus a large, whichever pools are cheap and available. Keep weights proportional to real capacity (a 2xlarge is 4x a large) or per-instance load behind the load balancer will skew.
Rule of thumb: target at least 10 distinct pools (roughly four types across three AZs) before tuning anything else. Below that, capacity-optimized allocation has nothing to optimize and your interruption rate stays high.
A subtle trap: don’t mix wildly different memory-to-vCPU ratios in one group if your app is memory-bound. m, c, and r families are not interchangeable for a JVM with a fixed heap. Diversify within a workload’s resource profile — attribute-based selection (section 7) solves this declaratively.
3. Allocation strategies compared
The Spot allocation strategy decides which pools the group draws from when it launches — the most consequential single setting for interruption rate.
| Strategy | Optimizes for | Interruption rate | When to use |
|---|---|---|---|
lowest-price |
Cheapest pools only | Highest | Almost never for production. Short, fully fault-tolerant batch only |
capacity-optimized |
Deepest-capacity pools | Lowest | Stateful-ish or long-running Spot where reclaim is expensive |
capacity-optimized-prioritized |
Deepest capacity, honoring your priority order | Low | You have a strong type preference (e.g., a reserved discount) but still want capacity-aware placement |
price-capacity-optimized |
Best balance of low price and deep capacity | Low | Default for almost everything. Cheap Spot without parking in pools about to be reclaimed |
price-capacity-optimized is the right default and what AWS recommends for the general case: strictly better than lowest-price because it weights spare capacity, and better than pure capacity-optimized for most workloads because it doesn’t ignore price to chase the single deepest pool.
Reach for capacity-optimized-prioritized only when priority genuinely matters — say you hold a Compute Savings Plan that makes one family cheaper to you than its public Spot price suggests, and you want the group to prefer it while still respecting real capacity. Your override order then becomes the priority list:
instances_distribution {
spot_allocation_strategy = "capacity-optimized-prioritized"
}
# override order = priority (first = most preferred), but capacity still gates the choice
override { instance_type = "m6i.xlarge" } # preferred (covered by a Savings Plan)
override { instance_type = "m6a.xlarge" }
override { instance_type = "m5.xlarge" }
Note lowest-price accepts a spot_instance_pools count (how many cheapest pools to spread across); the capacity-aware strategies ignore it because they evaluate all pools by capacity signal. Don’t set it and expect it to do anything under price-capacity-optimized.
4. Splitting On-Demand base from Spot
Two fields carve the fleet into a guaranteed floor and a Spot-heavy remainder:
on_demand_base_capacity— an absolute count (in capacity units if you use weights) of On-Demand instances the group always maintains, regardless of Spot availability. This is your blast-radius floor: the capacity that survives a total Spot drought.on_demand_percentage_above_base_capacity— of everything launched above the base, what percent is On-Demand vs Spot.20means 20% On-Demand / 80% Spot above the floor.
desired = 20 units, on_demand_base_capacity = 4, on_demand_percentage_above_base = 20
base: 4 units -> On-Demand (always)
above base: 16 units -> 20% OD = ~3 units OD, ~13 units Spot
-----------------------------------------------------------------
total: ~7 units On-Demand, ~13 units Spot
Size the base to the minimum capacity that must survive a worst-case Spot event — for a customer-facing tier, often “enough to serve degraded but non-zero traffic,” e.g. one AZ’s worth. The percentage above base is a dial between cost and steadiness: 100% Spot maximizes savings and is correct for stateless, horizontally scalable tiers with good drain; 20-30% On-Demand smooths the curve for tiers sensitive to a wave of simultaneous reclaims.
A sound pattern for a web fleet: small On-Demand base sized to one AZ, 100% Spot above it, price-capacity-optimized, a wide type list, and capacity_rebalance = true. The base guarantees you never hit zero; Spot does the bulk of the work at a fraction of the cost.
5. Handling interruptions gracefully
Three mechanisms compose into a clean drain. Use all three.
Capacity Rebalance (proactive replacement)
Setting capacity_rebalance = true tells the ASG to act on the rebalance recommendation — it proactively launches a replacement before the two-minute notice, so you are not racing a 120-second clock to find capacity. Pair it with a termination lifecycle hook so the old instance drains rather than vanishing the moment its replacement is healthy.
Lifecycle hook (the drain window)
A EC2_INSTANCE_TERMINATING hook puts the instance into Terminating:Wait and gives your automation a window to deregister and drain before the kill. The mechanics are covered in the warm pools article; the Spot-specific point is that this same hook fires for reclaims, so one drain path covers scale-in and interruption.
aws autoscaling put-lifecycle-hook \
--lifecycle-hook-name drain-on-terminate \
--auto-scaling-group-name web \
--lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING \
--heartbeat-timeout 120 \
--default-result CONTINUE
For Spot, keep heartbeat-timeout at or under 120s — you do not actually get more than two minutes once the interruption fires, so a longer timeout buys nothing and risks the hook outliving the instance. default-result CONTINUE is correct: if drain logic wedges, let the instance die rather than pinning it.
The drain handler
The most robust pattern for VM fleets is the open-source AWS Node Termination Handler (NTH), which watches IMDS and EventBridge for rebalance recommendations and interruption notices and triggers a drain. On a plain EC2 + ALB fleet the logic is straightforward — deregister from the target group, wait out the deregistration delay, then release the hook:
#!/usr/bin/env bash
# Runs on the instance; triggered by the interruption/rebalance signal.
set -euo pipefail
TG_ARN="arn:aws:elasticloadbalancing:us-east-1:111122223333:targetgroup/web/abc123"
# 1. Stop new traffic. Connection draining honors deregistration_delay.
aws elbv2 deregister-targets --target-group-arn "$TG_ARN" \
--targets "Id=$INSTANCE_ID"
# 2. Wait (bounded) for in-flight requests to finish.
aws elbv2 wait target-deregistered --target-group-arn "$TG_ARN" \
--targets "Id=$INSTANCE_ID" || true
# 3. Release the ASG hook so termination proceeds without waiting out the timeout.
aws autoscaling complete-lifecycle-action \
--lifecycle-hook-name drain-on-terminate \
--auto-scaling-group-name web \
--lifecycle-action-result CONTINUE \
--instance-id "$INSTANCE_ID"
Crucial constraint: the target group’s deregistration_delay.timeout_seconds must fit inside two minutes. The ALB default is 300s, which is longer than the entire Spot warning. Set it to 90s for Spot fleets so the drain actually completes before the instance is reclaimed:
resource "aws_lb_target_group" "web" {
name = "web"
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
deregistration_delay = 90 # MUST be < 120 for Spot
}
6. Spot in containerized fleets
Containers make Spot dramatically safer: the scheduler reschedules a reclaimed task/pod onto surviving capacity in seconds, and you already have health checks and rolling deploys.
ECS capacity providers
For ECS on EC2, attach a capacity provider backed by a mixed-instances ASG and let ECS managed scaling drive it. Run two providers — Spot and On-Demand — and split via a strategy with a base (always On-Demand) and a weight ratio above it. This mirrors on_demand_base_capacity at the ECS layer:
aws ecs put-cluster-capacity-providers \
--cluster prod \
--capacity-providers cp-spot cp-ondemand \
--default-capacity-provider-strategy \
capacityProvider=cp-ondemand,base=2,weight=1 \
capacityProvider=cp-spot,weight=4
Set managedTerminationProtection: ENABLED on the providers so ECS drains tasks off an instance before the ASG terminates it during scale-in. For Fargate, the equivalent is FARGATE_SPOT — same strategy syntax, no instances to manage, ~70% off Fargate On-Demand, and the same two-minute SIGTERM-then-drain contract for your container.
Karpenter consolidation and disruption controls
On EKS, Karpenter handles Spot natively and is best-in-class. You request spot (and optionally on-demand) in the NodePool requirements; Karpenter uses price-capacity-optimized internally and bin-packs aggressively. The controls that matter for stability are the disruption settings — how aggressively it consolidates and replaces nodes, which is where teams accidentally cause churn:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # spot preferred; on-demand fallback
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
budgets:
- nodes: "10%" # cap voluntary disruption to 10% of nodes at once
Karpenter also subscribes to interruption events via an SQS queue, cordons and drains the doomed node, then provisions a replacement — the same proactive-replacement idea as Capacity Rebalance, at the node level. The budgets block is the seatbelt: it caps how many nodes Karpenter voluntarily disrupts at once so consolidation never stampedes your workloads. Protect anything that cannot tolerate sudden node loss with karpenter.sh/do-not-disrupt: "true" on the pod, and use Pod Disruption Budgets so voluntary drains respect minimum availability.
7. Attribute-based instance selection (ABIS)
Hand-maintaining a list of fifteen instance types rots: a new generation ships (m7i) and your overrides are stale. Attribute-based instance selection flips it — you describe the requirements (vCPU range, memory range, exclusions) and EC2 expands them into every matching current and future type. New generations are picked up automatically, which future-proofs the policy and maximizes pool count.
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 2
on_demand_percentage_above_base_capacity = 0 # 100% Spot above base
spot_allocation_strategy = "price-capacity-optimized"
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.web.id
version = "$Latest"
}
override {
instance_requirements {
vcpu_count { min = 4 max = 16 }
memory_mib { min = 8192 max = 65536 } # bounds the mem:vCPU ratio
cpu_manufacturers = ["intel", "amd"]
burstable_performance = "excluded" # no t-family for steady prod load
instance_generations = ["current"]
# accelerator_types, local_storage, network bandwidth, etc. all expressible
}
}
}
}
Memory bounds do real work here: they stop the selector grabbing a c-family (low memory-per-vCPU) or r-family (high) when your app needs m-family balance — the section 2 trap, solved declaratively. Preview exactly which types a requirement set resolves to before shipping it:
aws ec2 get-instance-types-from-instance-requirements \
--architecture-types x86_64 \
--virtualization-types hvm \
--instance-requirements '{
"VCpuCount":{"Min":4,"Max":16},
"MemoryMiB":{"Min":8192,"Max":65536},
"BurstablePerformance":"excluded",
"InstanceGenerations":["current"]
}' \
--query 'InstanceTypes[].InstanceType' --output text
8. Observability and cost
You cannot manage Spot you cannot see. Three things to instrument:
Interruption signal. There is no clean CloudWatch counter for “this instance was interrupted,” so capture it from EventBridge. The interruption event is your source of truth for interruption rate per pool — the number you tune against:
{
"source": ["aws.ec2"],
"detail-type": ["EC2 Spot Instance Interruption Warning"]
}
Fan that rule out to a Lambda or Firehose that records instance-type, availability-zone, and timestamp. A rising interruption rate concentrated in one or two pools is the signal to widen the type list or drop the bad pools.
Savings tracking. Cost Explorer and the Cost and Usage Report (CUR) carry the Spot effective price per line item. The honest savings number is realized Spot cost vs the On-Demand cost of the same usage — query CUR rather than trusting the “up to 90%” headline. In Cost Explorer, group by Purchase Option to see the On-Demand / Spot / Reserved split at a glance.
Fallback-to-On-Demand. A mixed instances policy already degrades gracefully — if Spot is unavailable, the Spot portion launches as On-Demand. To bias that fill during a drought, set on_demand_allocation_strategy = "prioritized" and keep a sane base. The base plus this fallback is what lets you say “Spot saves us 75% and we never drop below floor capacity” and mean it.
Enterprise scenario
A video-pipeline team I worked with ran a stateless transcoding fleet — pull job from SQS, transcode a segment, write to S3, ack — on a single On-Demand ASG of c6i.4xlarge, burning roughly $48k/month at peak. Pure Spot was the obvious win, but their first attempt failed badly: they flipped the group to 100% Spot on lowest-price across just c6i.4xlarge and c5.4xlarge in two AZs. During a regional capacity crunch both pools were reclaimed within minutes, ~60% of workers died at once, SQS backed up for an hour, and in-flight segments had to be retried because workers were killed mid-transcode with no drain.
The constraint: jobs took up to 8 minutes, and a worker killed mid-job wasted that work (no checkpointing, and adding it was out of scope). They needed Spot economics without ever losing a large fraction of workers at once, and in-flight jobs had to finish or hand back cleanly.
The fix had three parts. First, real diversification: ABIS bounded to 12-24 vCPU compute-optimized types across all three AZs — roughly 30 pools instead of 4. Second, price-capacity-optimized with capacity_rebalance = true, placing workers in deep pools and proactively replacing any that got a rebalance recommendation. Third — the piece that actually saved the jobs — they moved acknowledgement to the end of processing and used the SQS visibility timeout as the drain mechanism: on the interruption notice a worker stops pulling new jobs and finishes its current segment; if it dies first, the message reappears after the visibility timeout and another worker picks it up. No lifecycle-hook gymnastics, no checkpointing.
instances_distribution {
on_demand_base_capacity = 2 # tiny floor; queue tolerates depth
on_demand_percentage_above_base_capacity = 0 # 100% Spot above base
spot_allocation_strategy = "price-capacity-optimized"
}
# + capacity_rebalance = true on the ASG
# + SQS VisibilityTimeout = 600 (> max 8-min job), ack only after S3 write
Result: ~78% compute cost reduction (about $48k/month down to roughly $11k), and a single capacity event now trims a handful of workers instead of 60% of the fleet — the queue absorbs the blip and reclaimed jobs are redelivered. The lesson: for queue-driven work, the cheapest and most reliable “drain” is idempotency plus a visibility timeout, not bespoke lifecycle handling.
Verify
- Confirm pool diversity. Resolve your ABIS requirements and sanity-check the count:
aws ec2 get-instance-types-from-instance-requirements \ --architecture-types x86_64 --virtualization-types hvm \ --instance-requirements file://reqs.json \ --query 'length(InstanceTypes)' - Confirm the actual purchase-option mix of running instances in the group:
aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names web \ --query 'AutoScalingGroups[0].Instances[].[InstanceId,InstanceType,LifecycleState,AvailabilityZone]' \ --output table - Verify Capacity Rebalance is on:
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names web \ --query 'AutoScalingGroups[0].CapacityRebalance' # => true - Verify the drain window fits the notice: confirm
deregistration_delay(or SQS visibility timeout) is comfortably under 120s. - Force a drill. Trigger an interruption in a non-prod copy with the FIS Spot interruption action and watch the drain run end to end:
aws fis start-experiment --experiment-template-id EXTxxxxxxxx # template uses aws:ec2:send-spot-instance-interruptions to fire a real 2-min notice - Confirm interruptions are being recorded by checking the EventBridge target (Lambda logs / Firehose) after the drill.