Spot is the single largest lever on an EC2 bill — routinely 70–90% off On-Demand for the exact same hardware — and most teams either avoid it in production out of reclaim anxiety or run it so naively that one capacity event takes out a meaningful slice of the fleet. Neither is necessary. EC2 Spot sells you spare Amazon capacity at a steep discount with one catch: AWS can take it back with a two-minute notice when it needs that capacity for On-Demand. Whether a reclaim is a non-event or an outage is entirely a function of two things you control — how many distinct capacity pools you draw from, and how you react to the notice. Get those right and Spot stops being scary; get them wrong and you relearn why people fear it.
This is the playbook I use to put interruption-tolerant production workloads on Spot at scale: diversified mixed instances policies, the allocation strategy that actually minimizes interruptions, the On-Demand floor that keeps you safe during a drought, and the drain machinery that turns a reclaim into a non-event. It builds on the Auto Scaling fundamentals — launch templates, lifecycle hooks, instance refresh — covered in the Advanced EC2 Auto Scaling: Warm Pools, Lifecycle Hooks, and Zero-Downtime Instance Refresh article; here the focus narrows to purchase options and resilience: the handful of settings that decide your interruption rate and your blast radius.
By the end you will stop guessing about Spot. You will know why price-capacity-optimized beats lowest-price for almost everything, how to size an On-Demand base to “survive a total Spot drought,” why the ALB deregistration_delay default of 300 seconds silently breaks your drain, and how queue-driven work gets the cheapest possible “drain” for free. Because this doubles as a reference you will return to mid-incident, the allocation strategies, the distribution fields, the interruption signals, the limits, and the failure modes are all laid out as scannable tables — read the prose once, then keep the tables open.
What problem this solves
The pain is concrete and expensive on both sides. On the cost side: a stateless fleet running pure On-Demand is leaving the largest discount AWS offers on the table — for a fleet burning ₹40 lakh/month, that is often ₹25–30 lakh/month of pure waste. On the resilience side: the naive fix — flipping the group to 100% Spot on the two instance types you happen to use — concentrates the whole fleet into two or three capacity pools, so a single regional capacity crunch reclaims 50–70% of your workers in minutes, your queue backs up, and in-flight work is lost because nobody drained.
What breaks without this knowledge: teams run Spot on lowest-price (highest interruption rate), in two AZs (too few pools), with no On-Demand base (no floor when Spot dries up), and with the ALB’s default 300-second deregistration delay (longer than the entire two-minute warning, so drains never finish). Each of those is a single setting away from correct, but the failure only shows up under a real capacity event — which, by definition, is the worst time to be learning this.
Who hits this: anyone running horizontally scalable, interruption-tolerant tiers — stateless web/API fleets behind a load balancer, queue-driven workers (SQS/Kafka consumers), batch and CI fleets, data-processing and transcoding pools, and Kubernetes/ECS data planes. It bites hardest on teams that adopted Spot for the savings headline without designing for the reclaim. The fix is never “hope Spot stays available” — it is diversify across many pools, place by capacity not just price, keep a survivable On-Demand floor, and make the drain idempotent and fast.
To frame the whole field before the deep dive, here is every lever this article covers, what it controls, and the one-line “get it right” rule:
| Lever | What it controls | Naive default that bites | Get-it-right rule |
|---|---|---|---|
| Pool diversity | How many (type, AZ) pools the fleet can use |
2 types × 2 AZs = 4 pools | ≥ 10 pools (4+ types × 3 AZs), or ABIS |
| Allocation strategy | Which pools Spot launches into | lowest-price (cheapest only) |
price-capacity-optimized |
| On-Demand base | The floor that survives a Spot drought | 0 (all Spot) |
One AZ’s worth of capacity units |
| % On-Demand above base | Smoothing the curve above the floor | 0 or 100 chosen blindly |
0% for stateless; 20–30% if reclaim-sensitive |
| Capacity Rebalance | Proactive replacement before the notice | false (race the 120 s clock) |
true + a terminating hook |
| Drain window | Time to deregister + finish in-flight work | ALB deregistration_delay = 300 |
< 120 s, or SQS visibility-timeout redelivery |
| Interruption visibility | Per-pool interruption rate to tune against | nothing instrumented | EventBridge → per-pool metric |
Learning objectives
By the end of this article you can:
- Explain Spot capacity pools, the rebalance recommendation vs the two-minute interruption notice, and read the notice from IMDSv2 on the instance.
- Design a diversified mixed instances policy across families, sizes, and AZs, and reason in capacity units with weighted capacities instead of instance count.
- Choose the right Spot allocation strategy for a workload — and articulate exactly why
price-capacity-optimizedis the default andlowest-priceis almost never correct for production. - Size
on_demand_base_capacityandon_demand_percentage_above_base_capacityto a guaranteed floor plus a Spot-heavy remainder, and configure graceful On-Demand fallback during a drought. - Compose Capacity Rebalance, a lifecycle hook, and a drain handler (NTH) into one idempotent drain path that covers both scale-in and interruption — and set
deregistration_delay < 120s. - Run Spot safely in containerized fleets: ECS capacity providers with a Spot/On-Demand split,
FARGATE_SPOT, and Karpenter disruption budgets and consolidation controls on EKS. - Use attribute-based instance selection (ABIS) to future-proof the type list, and instrument per-pool interruption rate plus realized savings from the Cost and Usage Report.
Prerequisites & where this fits
You should already understand the Auto Scaling fundamentals: a launch template captures the AMI, instance profile, security groups, and user data; an Auto Scaling group (ASG) maintains a desired capacity across subnets/AZs; lifecycle hooks pause an instance in Pending:Wait or Terminating:Wait to run automation; and instance refresh rolls a fleet to a new template. Those mechanics are the subject of the EC2 Auto Scaling, In Depth: Launch Templates, ASGs, Scaling Policies & Lifecycle Hooks and the warm-pools deep dive — this article assumes them and layers purchase options on top. You should also know your way around the EC2 instance families, AMIs, and IMDS, and have aws CLI v2 plus Terraform available.
This sits in the Cost Optimization & Resilience track of the AWS Zero-to-Hero path. It is downstream of the load-balancing fundamentals — your fleet almost always sits behind an Application or Network Load Balancer, and the target group’s drain behaviour is half of safe Spot. It pairs tightly with Resilient Messaging with SQS and SNS (the cheapest drain for queue work is a visibility timeout), with the Graviton arm64 migration guide (arm64 Spot pools are deep and cheap — diversify across architectures too), and with the FinOps Showback and Chargeback Platform on AWS for tracking realized Spot savings. Observability of the interruption signal lives in CloudWatch, CloudTrail & EventBridge.
A quick map of who owns what when you adopt Spot, so the right person tunes the right knob:
| Layer | What lives here | Who usually owns it | What it decides for Spot |
|---|---|---|---|
| Purchase policy | Mixed instances, OD base, %-above-base | Platform / FinOps | Cost split and the survivable floor |
| Allocation strategy | price-capacity-optimized vs others |
Platform | Interruption rate and scale-out speed |
| Type list / ABIS | Families, sizes, vCPU/memory bounds | App + platform | Pool count (the whole game) |
| Load balancer | Target group, deregistration_delay |
Network / platform | Whether the drain finishes in time |
| Drain handler | NTH / lifecycle hook / SQS visibility | App team | Whether in-flight work survives |
| Orchestrator | ECS capacity providers / Karpenter | Platform | Reschedule speed; container-level drain |
| Observability | EventBridge rule, CUR, Cost Explorer | FinOps / SRE | Per-pool tuning and honest savings |
Core concepts
Five mental models make every later decision obvious.
A capacity pool is one (instance type, Availability Zone) in a Region. m6i.large in us-east-1a is a different pool from m6i.large in us-east-1b, and from m6a.large in us-east-1a. Spot prices and availability are set per pool, and EC2 reclaims Spot capacity in that pool when it needs it back. This single fact drives the entire diversification strategy: if your whole fleet sits in one pool, one reclaim hits everything; spread across twenty pools, a reclaim trims a few percent.
You get two warnings, and they are different. A rebalance recommendation is an early, best-effort heads-up that an instance is at elevated risk of interruption — it can arrive minutes before any termination notice and is your cue to launch a replacement and drain proactively; it is advisory, and not every recommendation is followed by an interruption. The Spot interruption notice is the hard two-minute warning: you have ~120 seconds before the instance is stopped or terminated. Both arrive via instance metadata (IMDS) and via EventBridge.
You do not prevent interruptions — you make them rare and boring. Diversification plus capacity-optimized allocation makes reclaims statistically rare (the fleet lives in deep pools and any single reclaim is a small fraction); a fast, idempotent drain plus proactive replacement makes each reclaim operationally boring (the old instance bleeds off traffic, a replacement is already warm). Design for the reclaim and it stops being an incident.
The fleet thinks in capacity units, not instances. When you mix instance sizes, you assign each a weighted capacity so the ASG reasons in, say, vCPU units. desired_capacity = 24 then means 24 units — satisfiable as twelve larges (weight 2) or three 2xlarges (weight 8) or any mix — and on_demand_base_capacity is also expressed in units. This lets the group satisfy demand from whatever pools are cheap and available without skewing per-instance load behind the load balancer (as long as weights are proportional to real capacity).
The drain must fit inside two minutes. Once the interruption notice fires you have ~120 seconds, full stop. Every drain mechanism — ALB connection draining, a lifecycle hook heartbeat, an SQS visibility timeout — must be sized to complete within that window, or you will be reclaimed mid-drain and lose in-flight work. The ALB’s default deregistration_delay of 300 seconds is the classic trap: it is longer than the entire warning.
The vocabulary in one table
Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:
| Concept | One-line definition | Where it lives | Why it matters |
|---|---|---|---|
| Capacity pool | One (instance type, AZ) in a Region |
EC2 capacity layer | Reclaims happen per pool; diversify across many |
| Spot price | The discounted per-pool price (≤ On-Demand) | Per pool | You pay the market price, capped at On-Demand |
| Rebalance recommendation | Early “elevated risk” advisory | IMDS + EventBridge | Proactive replacement before the hard notice |
| Interruption notice | The hard 2-minute warning | IMDS + EventBridge | Last chance to drain |
| Mixed instances policy | One ASG drawing many types + OD/Spot | ASG config | The container for all Spot tuning |
| Allocation strategy | Which pools Spot launches into | instances_distribution |
The biggest single lever on interruption rate |
| Weighted capacity | A size’s “units” toward desired capacity | override per type |
Lets the group reason in vCPU, not count |
| OD base capacity | Guaranteed On-Demand floor (in units) | instances_distribution |
Survives a total Spot drought |
| Capacity Rebalance | ASG acts on rebalance recommendations | ASG flag | Proactive replacement, not racing the clock |
| Lifecycle hook | Pause in Terminating:Wait to drain |
ASG hook | The drain window for scale-in + interruption |
| NTH | AWS Node Termination Handler | On the instance / EKS | Watches signals, drives the drain |
| ABIS | Attribute-based instance selection | instance_requirements |
Describe needs; EC2 expands to all matching types |
Spot mechanics: pools, the two-minute notice, and rebalance
A capacity pool is one combination of (instance type, Availability Zone) in a Region. Spot prices and availability are set per pool, and EC2 reclaims Spot instances in that pool when it needs the capacity back. This is why diversification is the whole game.
Two signals warn you before an instance dies, both delivered through instance metadata (IMDS) and EventBridge. You read the interruption notice from IMDS on the instance itself. With IMDSv2 (which you should be enforcing), that is a token-authenticated request:
# On the instance. Returns 200 + JSON only when a notice is pending; 404 otherwise.
TOKEN=$(curl -sX PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 30")
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/spot/instance-action
# => {"action":"terminate","time":"2026-06-08T14:22:00Z"}
The rebalance recommendation lives at a sibling path (/latest/meta-data/events/recommendations/rebalance) and surfaces earlier. The two signals differ in timing, reliability, and what you should do with each — confusing them is a common design error:
| Property | Rebalance recommendation | Interruption notice |
|---|---|---|
| Timing | Minutes before (best-effort) | Exactly ~120 s before |
| Reliability | Advisory; may not be followed by an interruption | Guaranteed; instance will go |
| IMDS path | /events/recommendations/rebalance |
/spot/instance-action |
| EventBridge detail-type | EC2 Instance Rebalance Recommendation |
EC2 Spot Instance Interruption Warning |
| ASG behaviour | Acted on iff capacity_rebalance = true |
Always — instance enters termination |
| Right reaction | Launch a replacement; begin draining proactively | Stop pulling work; finish in-flight; deregister |
| Risk if ignored | You race the 120 s clock to find capacity | You lose in-flight work at T+120 s |
The interruption behaviour itself is configurable per Spot request and decides what “reclaim” actually does to the instance. For an ASG you almost always want terminate; stop/hibernate are for single Spot requests with attached state:
| Behaviour | What happens on reclaim | Restart cost | Use when | Constraint |
|---|---|---|---|---|
terminate |
Instance terminated; ASG launches a fresh one | Full boot on replacement | ASG fleets, stateless/queue workers | Default for ASG; only sane choice for diversified fleets |
stop |
Instance stopped; EBS preserved; restarts later | Boot from stopped state | Single Spot request with local state | Needs persistent root EBS; not for ASG |
hibernate |
RAM flushed to EBS; resumes in-memory state | Resume (faster than cold) | Long warm-up apps on single Spot | Limited instance/AMI support; not for ASG |
A few hard limits and facts about Spot itself are worth pinning down before you design against them:
| Fact / limit | Value | Why it matters |
|---|---|---|
| Interruption notice lead time | ~120 seconds | Every drain mechanism must finish inside this |
Spot price cap (empty spot_max_price) |
Capped at the On-Demand price | The correct default; you never pay more than OD |
| Spot vCPU service quota | Separate from On-Demand vCPU quota | Raise the All Standard Spot quota before scaling |
| Rebalance recommendation guarantee | None (best-effort) | Treat as a bonus, not a contract |
| Reclaim granularity | Per (type, AZ) pool |
Diversify across pools to shrink blast radius |
| Free-tier interaction | Spot is already discounted; no extra free tier | Savings come from the discount, not free tier |
| Spot price volatility | Smoothed; changes gradually, not per-bid | You rarely get priced out mid-run with an OD-capped max |
| Block duration (defined-duration Spot) | Deprecated for new customers | Don’t design around fixed Spot blocks |
| Persistent vs one-time request | ASG uses one-time requests it re-creates | The ASG, not a persistent request, maintains capacity |
Mental model: you do not “prevent” Spot interruptions. You make them statistically rare (diversification + capacity-optimized allocation) and operationally boring (rebalance + a fast, idempotent drain). Design for the reclaim and it stops being scary.
Designing a diversified mixed instances policy
The mixed instances policy lets one group pull from many instance types and blend On-Demand with Spot. Diversification is the whole game: more pools means lower interruption rate and faster scale-out, because capacity-optimized allocation has somewhere to go when a pool dries up. Build the type list across three axes — families, sizes, and AZs:
| Axis | What to vary | Why it multiplies pools | Watch-out |
|---|---|---|---|
| Families | m6i (Intel), m6a (AMD), m5, m5n |
AMD and Intel variants are near-identical for most workloads and double pool count for free | Don’t mix m/c/r if the app is memory-bound |
| Sizes | large, xlarge, 2xlarge of equivalent total capacity |
Each size is its own pool; weights let the group blend them | Keep weights proportional to real vCPU |
| AZs | Every AZ your subnets cover (≥ 3) | The subnet list multiplies every type into a new pool per AZ | Some types aren’t in every AZ; ABIS handles this |
| Architecture | x86_64 and arm64 (Graviton) | A whole parallel set of deep, cheap pools | Needs a multi-arch AMI/build |
| Generations | Current + one prior gen (e.g. m6i + m5) |
Older gens add pools that are often deeper | Don’t reach back so far that perf drops |
| Network/IO tiers | m5 and m5n (enhanced network) |
Sibling variants are extra pools | Only if the workload is indifferent to the difference |
Here is a diversified policy in Terraform. Note price-capacity-optimized, the small On-Demand base, and the weighted overrides:
resource "aws_autoscaling_group" "web" {
name = "web"
min_size = 6
max_size = 120
desired_capacity = 6
vpc_zone_identifier = var.private_subnet_ids # spread across >= 3 AZs
health_check_type = "ELB"
health_check_grace_period = 90
capacity_rebalance = true # proactive replacement
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 2
on_demand_percentage_above_base_capacity = 20
spot_allocation_strategy = "price-capacity-optimized"
spot_max_price = "" # empty = cap at On-Demand price (correct default)
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.web.id
version = "$Latest"
}
# Weighted so the group thinks in "8 vCPU units", not instance count.
override { instance_type = "m6i.2xlarge" weighted_capacity = "8" }
override { instance_type = "m6a.2xlarge" weighted_capacity = "8" }
override { instance_type = "m5.2xlarge" weighted_capacity = "8" }
override { instance_type = "m6i.xlarge" weighted_capacity = "4" }
override { instance_type = "m6a.xlarge" weighted_capacity = "4" }
override { instance_type = "m5n.xlarge" weighted_capacity = "4" }
override { instance_type = "m6i.large" weighted_capacity = "2" }
override { instance_type = "m6a.large" weighted_capacity = "2" }
}
}
}
With weights, desired_capacity = 6 means six units, not six instances — the group can satisfy it with three larges or one 2xlarge plus a large, whichever pools are cheap and available. Keep weights proportional to real capacity (a 2xlarge is 4× a large) or per-instance load behind the load balancer will skew.
How weighted capacity resolves, worked out so the math is unambiguous:
desired_capacity (units) |
Type chosen | Weight | Instances launched | Notes |
|---|---|---|---|---|
| 8 | m6i.2xlarge |
8 | 1 | One big instance satisfies it |
| 8 | m6i.large |
2 | 4 | Four small instances satisfy it |
| 8 | mix: 2xlarge + 2×large |
8 + 2 + 2 | 3 (= 12 units) | Group may slightly overshoot to fill |
| 24 | m6i.xlarge |
4 | 6 | Even split |
| 24 | mix across pools | various | several | Capacity-aware placement picks deep pools |
Two subtle traps to design around:
| Trap | What goes wrong | Fix |
|---|---|---|
| Too few pools | < 10 pools → capacity-optimized has nothing to optimize; interruption rate stays high |
≥ 10 pools (4+ types × 3 AZs), or ABIS |
| Mixed memory:vCPU ratios | m/c/r aren’t interchangeable for a JVM with a fixed heap; the group grabs a starved type |
Diversify within a resource profile; bound mem:vCPU via ABIS |
| Weights not proportional | A 2xlarge weighted 1 gets the same LB traffic as a large → overload |
Weight by real vCPU (large=2, xlarge=4, 2xlarge=8) |
| Wildly different sizes | A single huge instance carries too much of the fleet | Keep the size spread within ~4× |
Rule of thumb: target at least 10 distinct pools (roughly four types across three AZs) before tuning anything else. Below that, capacity-optimized allocation has nothing to optimize and your interruption rate stays high.
Allocation strategies compared
The Spot allocation strategy decides which pools the group draws from when it launches — the most consequential single setting for interruption rate. There are four, and for production the choice is almost always price-capacity-optimized:
| Strategy | Optimizes for | Interruption rate | Honors priority? | When to use |
|---|---|---|---|---|
lowest-price |
Cheapest pools only | Highest | No | Almost never for production. Short, fully fault-tolerant batch only |
capacity-optimized |
Deepest-capacity pools | Lowest | No | Stateful-ish or long-running Spot where a reclaim is expensive |
capacity-optimized-prioritized |
Deepest capacity, honoring your order | Low | Yes (override order) | Strong type preference (e.g. a Savings Plan) but still capacity-aware |
price-capacity-optimized |
Best balance of low price and deep capacity | Low | No | Default for almost everything. Cheap Spot without parking in soon-to-be-reclaimed pools |
price-capacity-optimized is the right default and what AWS recommends for the general case: strictly better than lowest-price because it weights spare capacity, and better than pure capacity-optimized for most workloads because it doesn’t ignore price to chase the single deepest pool.
Reach for capacity-optimized-prioritized only when priority genuinely matters — say you hold a Compute Savings Plan that makes one family cheaper to you than its public Spot price suggests, and you want the group to prefer it while still respecting real capacity. Your override order then becomes the priority list:
instances_distribution {
spot_allocation_strategy = "capacity-optimized-prioritized"
}
# override order = priority (first = most preferred), but capacity still gates the choice
override { instance_type = "m6i.xlarge" } # preferred (covered by a Savings Plan)
override { instance_type = "m6a.xlarge" }
override { instance_type = "m5.xlarge" }
A decision table to pick the strategy from the workload’s properties:
| If the workload is… | …and a reclaim is… | Choose | Because |
|---|---|---|---|
| Stateless web/API behind an LB | Cheap (LB reroutes in seconds) | price-capacity-optimized |
Cheapest Spot with low churn |
| Queue-driven, idempotent | Cheap (message redelivered) | price-capacity-optimized |
Same; queue absorbs the blip |
| Long-running job, no checkpoint | Expensive (work lost) | capacity-optimized |
Maximize time-to-reclaim |
| Covered by a Savings Plan on one family | Moderate | capacity-optimized-prioritized |
Prefer the discounted family, stay capacity-aware |
| Short, fully fault-tolerant batch | Trivial | lowest-price (or PCO) |
Only case lowest-price is defensible |
One gotcha: lowest-price accepts a spot_instance_pools count (how many of the cheapest pools to spread across); the capacity-aware strategies ignore it because they evaluate all pools by capacity signal. Don’t set it and expect it to do anything under price-capacity-optimized:
| Setting | Applies to | Default | Effect | Gotcha |
|---|---|---|---|---|
spot_allocation_strategy |
All | lowest-price (legacy default) |
Picks the pool-selection algorithm | Set it explicitly; the legacy default is the worst one |
spot_instance_pools |
lowest-price only |
2 | Spread across N cheapest pools | Silently ignored by capacity-aware strategies |
spot_max_price |
All | “” (= On-Demand) | Cap on the per-pool price you’ll pay | Empty is correct; a low cap shrinks your pools |
on_demand_allocation_strategy |
On-Demand portion | lowest-price |
How OD instances are placed | Set prioritized for predictable fallback |
Splitting On-Demand base from Spot
Two fields carve the fleet into a guaranteed floor and a Spot-heavy remainder. Understanding exactly what each does — and that they operate on units when you use weights — is the difference between a safe floor and an accidental all-Spot fleet:
| Field | Type | What it guarantees | Sizing guidance |
|---|---|---|---|
on_demand_base_capacity |
Absolute count (capacity units) | A floor of On-Demand that survives a total Spot drought | The minimum capacity that must always serve — often one AZ’s worth |
on_demand_percentage_above_base_capacity |
Percent (0–100) | Of capacity above the base, the OD/Spot split | 0% for stateless+drain; 20–30% if reclaim-sensitive |
on_demand_allocation_strategy |
lowest-price | prioritized |
How the OD portion is placed | prioritized for predictable fallback during a drought |
Worked example of how the split resolves:
desired = 20 units, on_demand_base_capacity = 4, on_demand_percentage_above_base = 20
base: 4 units -> On-Demand (always)
above base: 16 units -> 20% OD = ~3 units OD, ~13 units Spot
-----------------------------------------------------------------
total: ~7 units On-Demand, ~13 units Spot
The arithmetic across a range of settings, so you can pick numbers with intent:
desired |
base |
%-above |
OD units (base + above) | Spot units | OD share |
|---|---|---|---|---|---|
| 20 | 0 | 0 | 0 | 20 | 0% (all Spot) |
| 20 | 4 | 0 | 4 | 16 | 20% |
| 20 | 4 | 20 | ~7 | ~13 | ~35% |
| 20 | 4 | 100 | 20 | 0 | 100% (no Spot above floor) |
| 40 | 8 | 25 | ~16 | ~24 | ~40% |
| 100 | 10 | 10 | ~19 | ~81 | ~19% |
Size the base to the minimum capacity that must survive a worst-case Spot event — for a customer-facing tier, often “enough to serve degraded but non-zero traffic,” e.g. one AZ’s worth. The percentage above base is a dial between cost and steadiness:
| Profile | base sizing |
%-above |
Net effect |
|---|---|---|---|
| Stateless web, good drain | One AZ’s worth | 0% | Max savings; floor survives drought; LB reroutes reclaims |
| Reclaim-sensitive tier | One AZ’s worth | 20–30% | Smooths a wave of simultaneous reclaims at modest extra cost |
| Queue workers, idempotent | Tiny (queue tolerates depth) | 0% | Cheapest; queue absorbs reclaim blips |
| Latency-critical, thin margins | Larger floor | 30–50% | More steady-state OD; smaller Spot upside |
A sound pattern for a web fleet: small On-Demand base sized to one AZ, 100% Spot above it, price-capacity-optimized, a wide type list, and capacity_rebalance = true. The base guarantees you never hit zero; Spot does the bulk of the work at a fraction of the cost.
Handling interruptions gracefully
Three mechanisms compose into a clean drain. Use all three. Here is how they relate before the detail:
| Mechanism | Trigger it acts on | What it does | Without it… |
|---|---|---|---|
| Capacity Rebalance | Rebalance recommendation | Launches a replacement before the hard notice | You race the 120 s clock to find capacity |
| Lifecycle hook | Termination (scale-in + interruption) | Pauses in Terminating:Wait for your drain |
The instance vanishes the moment it’s marked for death |
| Drain handler (NTH) | IMDS/EventBridge signals | Deregisters from the LB, waits, releases the hook | Nothing actually drains; in-flight requests drop |
Capacity Rebalance (proactive replacement)
Setting capacity_rebalance = true tells the ASG to act on the rebalance recommendation — it proactively launches a replacement before the two-minute notice, so you are not racing a 120-second clock to find capacity. Pair it with a termination lifecycle hook so the old instance drains rather than vanishing the moment its replacement is healthy.
Lifecycle hook (the drain window)
A EC2_INSTANCE_TERMINATING hook puts the instance into Terminating:Wait and gives your automation a window to deregister and drain before the kill. The mechanics are covered in the warm pools article; the Spot-specific point is that this same hook fires for reclaims, so one drain path covers scale-in and interruption.
aws autoscaling put-lifecycle-hook \
--lifecycle-hook-name drain-on-terminate \
--auto-scaling-group-name web \
--lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING \
--heartbeat-timeout 120 \
--default-result CONTINUE
For Spot, keep heartbeat-timeout at or under 120 s — you do not actually get more than two minutes once the interruption fires, so a longer timeout buys nothing and risks the hook outliving the instance. default-result CONTINUE is correct: if drain logic wedges, let the instance die rather than pinning it. The hook knobs and their Spot-correct values:
| Hook setting | What it controls | ASG default | Spot-correct value | Why |
|---|---|---|---|---|
lifecycle-transition |
When the hook fires | — | EC2_INSTANCE_TERMINATING |
Covers scale-in and interruption |
heartbeat-timeout |
Max wait in Terminating:Wait |
3600 s | ≤ 120 s | You don’t get more than 2 min anyway |
default-result |
What happens on timeout | ABANDON |
CONTINUE |
Let the instance die rather than pin it |
notification-target-arn |
Where the hook event goes (optional) | none | SNS/SQS if you fan out | For centralized drain orchestration |
The drain handler
The most robust pattern for VM fleets is the open-source AWS Node Termination Handler (NTH), which watches IMDS and EventBridge for rebalance recommendations and interruption notices and triggers a drain. On a plain EC2 + ALB fleet the logic is straightforward — deregister from the target group, wait out the deregistration delay, then release the hook:
#!/usr/bin/env bash
# Runs on the instance; triggered by the interruption/rebalance signal.
set -euo pipefail
TG_ARN="arn:aws:elasticloadbalancing:us-east-1:111122223333:targetgroup/web/abc123"
# 1. Stop new traffic. Connection draining honors deregistration_delay.
aws elbv2 deregister-targets --target-group-arn "$TG_ARN" \
--targets "Id=$INSTANCE_ID"
# 2. Wait (bounded) for in-flight requests to finish.
aws elbv2 wait target-deregistered --target-group-arn "$TG_ARN" \
--targets "Id=$INSTANCE_ID" || true
# 3. Release the ASG hook so termination proceeds without waiting out the timeout.
aws autoscaling complete-lifecycle-action \
--lifecycle-hook-name drain-on-terminate \
--auto-scaling-group-name web \
--lifecycle-action-result CONTINUE \
--instance-id "$INSTANCE_ID"
Crucial constraint: the target group’s deregistration_delay.timeout_seconds must fit inside two minutes. The ALB default is 300 s, which is longer than the entire Spot warning. Set it to 90 s for Spot fleets so the drain actually completes before the instance is reclaimed:
resource "aws_lb_target_group" "web" {
name = "web"
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
deregistration_delay = 90 # MUST be < 120 for Spot
}
NTH runs in two modes; pick by whether you operate VMs or Kubernetes:
| NTH mode | Runs as | Watches | Drains by | Best for |
|---|---|---|---|---|
| IMDS mode | A daemon on each instance | Local IMDS (/spot/instance-action, rebalance) |
Your hook script (deregister, complete-lifecycle) | Plain EC2 + ALB/NLB fleets |
| Queue-processor mode | A central deployment | An SQS queue fed by EventBridge | Cordoning/draining the K8s node | EKS clusters (managed node groups) |
The time budget inside the two-minute notice, so every component fits:
| Step | Typical duration | Runs in | Must finish by |
|---|---|---|---|
| Signal received (IMDS/EventBridge) | < 1 s | NTH | T+0 |
| Stop pulling new work / deregister | 1–3 s | Drain handler | T+5 s |
Connection draining (deregistration_delay) |
30–90 s | ALB | < T+120 s |
| In-flight requests complete | within drain window | App | < T+120 s |
complete-lifecycle-action CONTINUE |
1–2 s | Drain handler | before T+120 s |
Spot in containerized fleets
Containers make Spot dramatically safer: the scheduler reschedules a reclaimed task/pod onto surviving capacity in seconds, and you already have health checks and rolling deploys. The container layer changes who handles the drain:
| Platform | Who handles interruption | OD/Spot split mechanism | Drain primitive |
|---|---|---|---|
| ECS on EC2 | ECS + capacity providers | Two capacity providers with base + weight |
Managed termination protection drains tasks |
| Fargate Spot | AWS-managed | FARGATE_SPOT capacity provider |
2-min SIGTERM then stop; your container drains |
| EKS (Karpenter) | Karpenter | karpenter.sh/capacity-type requirement |
Cordon + drain + provision replacement |
| EKS (managed node groups + NTH) | NTH queue-processor | Separate Spot/OD node groups | NTH cordons/drains the node |
ECS capacity providers
For ECS on EC2, attach a capacity provider backed by a mixed-instances ASG and let ECS managed scaling drive it. Run two providers — Spot and On-Demand — and split via a strategy with a base (always On-Demand) and a weight ratio above it. This mirrors on_demand_base_capacity at the ECS layer:
aws ecs put-cluster-capacity-providers \
--cluster prod \
--capacity-providers cp-spot cp-ondemand \
--default-capacity-provider-strategy \
capacityProvider=cp-ondemand,base=2,weight=1 \
capacityProvider=cp-spot,weight=4
Set managedTerminationProtection: ENABLED on the providers so ECS drains tasks off an instance before the ASG terminates it during scale-in. The capacity-provider strategy fields map cleanly onto the ASG distribution concepts:
| Strategy field | Meaning | ASG analogue | Example value |
|---|---|---|---|
base |
Minimum tasks always on this provider | on_demand_base_capacity |
2 (on cp-ondemand) |
weight |
Relative share of tasks above the base | inverse of %-above-base |
1 OD : 4 Spot = 20% OD |
managedScaling |
ECS drives ASG capacity to fit tasks | (ECS-managed) | ENABLED |
managedTerminationProtection |
Drain tasks before scale-in termination | lifecycle hook | ENABLED |
For Fargate, the equivalent is FARGATE_SPOT — same strategy syntax, no instances to manage, ~70% off Fargate On-Demand, and the same two-minute SIGTERM-then-drain contract for your container.
Karpenter consolidation and disruption controls
On EKS, Karpenter handles Spot natively and is best-in-class. You request spot (and optionally on-demand) in the NodePool requirements; Karpenter uses price-capacity-optimized internally and bin-packs aggressively. The controls that matter for stability are the disruption settings — how aggressively it consolidates and replaces nodes, which is where teams accidentally cause churn:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # spot preferred; on-demand fallback
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
budgets:
- nodes: "10%" # cap voluntary disruption to 10% of nodes at once
Karpenter also subscribes to interruption events via an SQS queue, cordons and drains the doomed node, then provisions a replacement — the same proactive-replacement idea as Capacity Rebalance, at the node level. The disruption controls and what each protects against:
| Control | What it does | Default-ish value | Protects against |
|---|---|---|---|
consolidationPolicy |
When Karpenter consolidates nodes | WhenEmptyOrUnderutilized |
Wasted spend on idle nodes |
consolidateAfter |
Idle period before consolidating | 1m–15m |
Thrashing on brief dips |
disruption.budgets |
Cap on voluntary disruption at once | 10% |
Consolidation stampeding workloads |
karpenter.sh/do-not-disrupt (pod) |
Exempt a pod from voluntary disruption | none | Long jobs killed by consolidation |
| Pod Disruption Budget (PDB) | Minimum available replicas during drains | per workload | Voluntary drains breaching availability |
The budgets block is the seatbelt: it caps how many nodes Karpenter voluntarily disrupts at once so consolidation never stampedes your workloads. Protect anything that cannot tolerate sudden node loss with karpenter.sh/do-not-disrupt: "true" on the pod, and use Pod Disruption Budgets so voluntary drains respect minimum availability.
Attribute-based instance selection (ABIS)
Hand-maintaining a list of fifteen instance types rots: a new generation ships (m7i) and your overrides are stale. Attribute-based instance selection (ABIS) flips it — you describe the requirements (vCPU range, memory range, exclusions) and EC2 expands them into every matching current and future type. New generations are picked up automatically, which future-proofs the policy and maximizes pool count.
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 2
on_demand_percentage_above_base_capacity = 0 # 100% Spot above base
spot_allocation_strategy = "price-capacity-optimized"
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.web.id
version = "$Latest"
}
override {
instance_requirements {
vcpu_count { min = 4 max = 16 }
memory_mib { min = 8192 max = 65536 } # bounds the mem:vCPU ratio
cpu_manufacturers = ["intel", "amd"]
burstable_performance = "excluded" # no t-family for steady prod load
instance_generations = ["current"]
# accelerator_types, local_storage, network bandwidth, etc. all expressible
}
}
}
}
Memory bounds do real work here: they stop the selector grabbing a c-family (low memory-per-vCPU) or r-family (high) when your app needs m-family balance — the diversification trap, solved declaratively. The attributes you will reach for most, and what each prevents:
| Attribute | Purpose | Example | Prevents |
|---|---|---|---|
vcpu_count (min/max) |
Bound instance size | 4–16 | Tiny or oversized instances skewing LB load |
memory_mib (min/max) |
Bound the mem:vCPU ratio | 8192–65536 | Grabbing starved c or bloated r types |
cpu_manufacturers |
Limit to Intel/AMD/AWS (Graviton) | ["intel","amd"] |
Accidentally pulling an unsupported arch |
burstable_performance |
Include/exclude t-family |
excluded |
Credit-throttled CPUs under steady load |
instance_generations |
Current vs previous gen | ["current"] |
Old, less efficient hardware |
accelerator_types |
Require/exclude GPUs/inference | excluded | Paying for accelerators you don’t use |
local_storage / local_storage_types |
Require NVMe instance store | excluded |
Mismatched storage assumptions |
allowed_instance_types / excluded_instance_types |
Allow/deny by pattern | ["m*"] |
Whole families you don’t want |
Preview exactly which types a requirement set resolves to before shipping it — this is the single most important ABIS habit:
aws ec2 get-instance-types-from-instance-requirements \
--architecture-types x86_64 \
--virtualization-types hvm \
--instance-requirements '{
"VCpuCount":{"Min":4,"Max":16},
"MemoryMiB":{"Min":8192,"Max":65536},
"BurstablePerformance":"excluded",
"InstanceGenerations":["current"]
}' \
--query 'InstanceTypes[].InstanceType' --output text
ABIS vs a hand-maintained type list, so you choose deliberately:
| Dimension | Hand-maintained override list |
ABIS (instance_requirements) |
|---|---|---|
| New generations | Manual edit when m7i ships |
Picked up automatically |
| Pool count | Whatever you typed (often too few) | Every matching type → many pools |
| Precision | Exact, but rots | Declarative bounds; preview before ship |
| Memory:vCPU safety | You must curate | Enforced by memory_mib bounds |
| Best for | A short, deliberate preference list | Maximizing diversity and future-proofing |
Observability and cost
You cannot manage Spot you cannot see. Three things to instrument — the interruption signal, realized savings, and fallback behaviour:
| Signal | Source | What it tells you | The number you act on |
|---|---|---|---|
| Interruption rate per pool | EventBridge EC2 Spot Instance Interruption Warning |
Which (type, AZ) pools are churning |
Rising rate in 1–2 pools → widen list / drop pools |
| Rebalance frequency | EventBridge EC2 Instance Rebalance Recommendation |
Elevated-risk early warning volume | High volume → diversify more |
| Realized savings | Cost and Usage Report (CUR) | Spot effective price per line item | Realized Spot cost vs OD cost of same usage |
| Purchase-option mix | Cost Explorer (group by Purchase Option) | OD / Spot / Reserved split | Drift from your intended split |
| Fallback to On-Demand | ASG activity / instance purchase type | Spot drought filling as OD | Sustained OD fill → capacity problem |
Interruption signal. There is no clean CloudWatch counter for “this instance was interrupted,” so capture it from EventBridge. The interruption event is your source of truth for interruption rate per pool — the number you tune against:
{
"source": ["aws.ec2"],
"detail-type": ["EC2 Spot Instance Interruption Warning"]
}
Fan that rule out to a Lambda or Firehose that records instance-type, availability-zone, and timestamp. A rising interruption rate concentrated in one or two pools is the signal to widen the type list or drop the bad pools.
Savings tracking. Cost Explorer and the Cost and Usage Report (CUR) carry the Spot effective price per line item. The honest savings number is realized Spot cost vs the On-Demand cost of the same usage — query CUR rather than trusting the “up to 90%” headline. In Cost Explorer, group by Purchase Option to see the On-Demand / Spot / Reserved split at a glance.
Fallback-to-On-Demand. A mixed instances policy already degrades gracefully — if Spot is unavailable, the Spot portion launches as On-Demand. To bias that fill during a drought, set on_demand_allocation_strategy = "prioritized" and keep a sane base. The base plus this fallback is what lets you say “Spot saves us 75% and we never drop below floor capacity” and mean it.
Architecture at a glance
The diagram traces the Spot fleet as it actually behaves, left to right, and maps each failure class onto the exact hop where it bites. Start at PURCHASE POLICY: a mixed instances policy declares an On-Demand base (the floor that survives a drought) plus a wide type list or ABIS requirement set — this is where pool count is born. That feeds ALLOCATION, where price-capacity-optimized chooses which of the available capacity pools ((type, AZ) combinations, ≥ 10 of them) to launch into; the red lowest-price node is the anti-pattern that chases the cheapest pool and concentrates the fleet. The chosen instances land in the RUNNING FLEET — roughly 80% Spot and 20% On-Demand spread across three AZs, sitting behind an ALB target group whose deregistration_delay must be under 120 seconds or the drain never finishes.
When AWS needs the capacity back, the INTERRUPTION SIGNAL zone fires: EventBridge delivers the rebalance recommendation and the two-minute notice, and Capacity Rebalance launches a replacement before the hard notice. That triggers the DRAIN MACHINERY: NTH or a lifecycle hook holds the instance in Terminating:Wait while it deregisters and bleeds off connections, or — for queue work — an SQS visibility timeout simply redelivers the message to another worker after the doomed one dies. The five numbered badges mark the failure points: choosing lowest-price (1), too few pools (2), an undersized On-Demand base (3), a too-long deregistration delay (4), and no idempotent drain path (5). Read the legend as symptom · how to confirm · fix — that is the whole operating model on one canvas.
Real-world scenario
Streamforge Media ran a stateless transcoding fleet — pull a job from SQS, transcode a video segment, write to S3, ack — on a single On-Demand ASG of c6i.4xlarge, burning roughly ₹40 lakh/month (about $48k) at peak across ~120 instances. Pure Spot was the obvious win, and a junior engineer shipped the “obvious” version first: flip the group to 100% Spot on lowest-price across just c6i.4xlarge and c5.4xlarge in two AZs. It looked fine for a week.
Then a regional capacity crunch hit on a Saturday during a customer’s big content drop. Both pools — and there were really only four (type, AZ) combinations — were reclaimed within minutes. About 60% of workers died at once, SQS backed up for an hour, and in-flight segments had to be retried because workers were killed mid-transcode with no drain. The customer’s content was late. The post-mortem was not fun.
The constraint that shaped the fix: jobs took up to 8 minutes, and a worker killed mid-job wasted that work — there was no checkpointing, and adding it was out of scope. They needed Spot economics without ever losing a large fraction of workers at once, and in-flight jobs had to either finish or hand back cleanly.
The fix had three parts. First, real diversification: ABIS bounded to 12–24 vCPU compute-optimized types across all three AZs — roughly 30 pools instead of 4. Second, the right allocation: price-capacity-optimized with capacity_rebalance = true, placing workers in deep pools and proactively replacing any that got a rebalance recommendation. Third — the piece that actually saved the jobs — they moved the acknowledgement to the end of processing and used the SQS visibility timeout as the drain mechanism: on the interruption notice a worker stops pulling new jobs and finishes its current segment; if it dies first, the message reappears after the visibility timeout and another worker picks it up. No lifecycle-hook gymnastics, no checkpointing.
instances_distribution {
on_demand_base_capacity = 2 # tiny floor; queue tolerates depth
on_demand_percentage_above_base_capacity = 0 # 100% Spot above base
spot_allocation_strategy = "price-capacity-optimized"
}
# + capacity_rebalance = true on the ASG
# + SQS VisibilityTimeout = 600 (> max 8-min job), ack only after the S3 write
The change as a before/after, because the contrast is the lesson:
| Dimension | Before (naive Spot) | After (diversified) |
|---|---|---|
| Allocation strategy | lowest-price |
price-capacity-optimized |
| Type list | 2 types | ABIS, 12–24 vCPU compute-optimized |
| AZs | 2 | 3 |
| Capacity pools | ~4 | ~30 |
| Capacity Rebalance | off | on |
| Drain for in-flight work | none (killed mid-job) | SQS visibility-timeout redelivery |
| Blast radius of one event | ~60% of fleet | a handful of workers |
| Monthly compute | ~₹40 lakh (all OD) | ~₹9 lakh (~78% off) |
Result: ~78% compute cost reduction (about ₹40 lakh down to roughly ₹9 lakh / $11k), and a single capacity event now trims a handful of workers instead of 60% of the fleet — the queue absorbs the blip and reclaimed jobs are redelivered. The lesson on the wall: “For queue-driven work, the cheapest and most reliable ‘drain’ is idempotency plus a visibility timeout, not bespoke lifecycle handling.”
Advantages and disadvantages
Spot at scale is a genuine trade-off, not a free lunch — it trades a small, manageable operational burden for a very large discount. Weigh it honestly:
| Advantages (why Spot wins) | Disadvantages (why it bites) |
|---|---|
| 70–90% off On-Demand for identical hardware — the largest single lever on an EC2 bill | AWS can reclaim with a 2-minute notice; you must design for it, not wish it away |
Diversification + price-capacity-optimized make interruptions statistically rare |
Naive config (lowest-price, 2 AZs, no base) concentrates risk and causes mass reclaims |
| One mixed-instances policy blends OD floor + Spot bulk with no extra moving parts | More settings to get right; the failure only shows under a real capacity event |
| Graceful On-Demand fallback means you never drop below floor capacity during a drought | OD fallback costs more during a drought — your bill is variable, not flat |
| Capacity Rebalance + NTH/Karpenter turn a reclaim into a proactive, drained replacement | Requires a working, idempotent drain path; bolted-on later it’s painful |
| Containers/queues make Spot nearly transparent (reschedule/redeliver in seconds) | Long, non-idempotent, un-checkpointed jobs are a poor fit and lose work on reclaim |
| Per-pool interruption signal lets you tune continuously | No clean CloudWatch counter; you must instrument EventBridge yourself |
Spot is the right default for stateless web/API tiers behind a load balancer, queue-driven and batch workers, CI fleets, and Kubernetes/ECS data planes — anywhere a reclaimed unit of work is cheap to redo. It is the wrong default for stateful singletons (a primary database), long un-checkpointed jobs where a reclaim wastes hours, and licence-bound workloads pinned to one instance type (which collapses your pool count). The disadvantages are all manageable — but only if you know they exist, which is the point of this article.
Hands-on lab
Stand up a diversified Spot ASG behind an ALB, confirm the purchase-option mix, force a real interruption with AWS Fault Injection Service (FIS), and watch the drain — then tear it all down. Free-tier-friendly in the sense that you run it for an hour on small instances and delete everything. Run in a region with ≥ 3 AZs (e.g. ap-south-1). Assumes a VPC with private subnets and a launch template already exist (from the Auto Scaling deep-dive lab).
Step 1 — Variables.
export AWS_REGION=ap-south-1
ASG=lab-spot-web
LT_ID=lt-0abc123def456 # your existing launch template
SUBNETS="subnet-aaa,subnet-bbb,subnet-ccc" # one per AZ
Step 2 — Create the mixed-instances ASG with a diversified policy.
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name $ASG \
--min-size 4 --max-size 20 --desired-capacity 6 \
--vpc-zone-identifier "$SUBNETS" \
--capacity-rebalance \
--mixed-instances-policy '{
"LaunchTemplate": {
"LaunchTemplateSpecification": {"LaunchTemplateId":"'$LT_ID'","Version":"$Latest"},
"Overrides": [
{"InstanceType":"m6i.large","WeightedCapacity":"2"},
{"InstanceType":"m6a.large","WeightedCapacity":"2"},
{"InstanceType":"m5.large","WeightedCapacity":"2"},
{"InstanceType":"m6i.xlarge","WeightedCapacity":"4"}
]
},
"InstancesDistribution": {
"OnDemandBaseCapacity": 2,
"OnDemandPercentageAboveBaseCapacity": 20,
"SpotAllocationStrategy": "price-capacity-optimized"
}
}'
Expected: no error; the ASG begins launching instances across your AZs.
Step 3 — Add the terminating lifecycle hook (the drain window).
aws autoscaling put-lifecycle-hook \
--lifecycle-hook-name drain-on-terminate \
--auto-scaling-group-name $ASG \
--lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING \
--heartbeat-timeout 120 --default-result CONTINUE
Step 4 — Confirm the purchase-option mix. This is the proof the OD base and Spot split landed:
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names $ASG \
--query 'AutoScalingGroups[0].Instances[].[InstanceId,InstanceType,LifecycleState,AvailabilityZone]' \
--output table
# Expect a mix of instance types across AZs; ~2 units On-Demand, the rest Spot.
Confirm Capacity Rebalance is on:
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names $ASG \
--query 'AutoScalingGroups[0].CapacityRebalance' # => true
Step 5 — Wire an EventBridge rule to capture interruptions.
aws events put-rule --name lab-spot-interruptions \
--event-pattern '{"source":["aws.ec2"],"detail-type":["EC2 Spot Instance Interruption Warning"]}'
# Then put-targets to a Lambda/CloudWatch Logs group to record type + AZ + time.
Step 6 — Force a real interruption with FIS and watch the drain. FIS fires a genuine two-minute notice so you can validate the whole path end-to-end:
# Template uses aws:ec2:send-spot-instance-interruptions to fire a real 2-min notice.
aws fis start-experiment --experiment-template-id EXTxxxxxxxx
# Watch: the targeted instance gets a notice, enters Terminating:Wait,
# deregisters from the target group, then terminates; a replacement launches.
Validation checklist. You created a diversified mixed-instances ASG, confirmed a real OD/Spot split across AZs, attached a drain hook sized to the notice, captured the interruption signal, and triggered a real reclaim to watch the drain run. No production traffic was harmed. The steps mapped to what each proves:
| Step | What you did | What it proves |
|---|---|---|
| 2 | Diversified mixed-instances ASG | Pool diversity + price-capacity-optimized are one API call |
| 3 | Terminating lifecycle hook | One drain path covers scale-in and interruption |
| 4 | Inspect instances | The OD base + Spot split actually landed |
| 5 | EventBridge rule | The interruption signal is captured for per-pool tuning |
| 6 | FIS interruption | The drain completes inside the 2-minute notice |
Teardown (avoid lingering instance charges).
aws autoscaling delete-auto-scaling-group --auto-scaling-group-name $ASG --force-delete
aws events delete-rule --name lab-spot-interruptions
Cost note. Six small Spot instances for an hour is a few rupees; force-deleting the ASG terminates everything immediately. FIS charges per action-minute — negligible for a single experiment.
Common mistakes & troubleshooting
This is the playbook — the part you bookmark. First as a scannable table you can read mid-incident, then the entries that bite hardest expanded with the full confirm-command detail.
| # | Symptom | Root cause | Confirm (exact cmd / console path) | Fix |
|---|---|---|---|---|
| 1 | A capacity event reclaims a huge slice of the fleet at once | lowest-price + too few pools concentrates the fleet |
EventBridge interruption events spike in 1–2 pools; describe-auto-scaling-groups shows few distinct types |
price-capacity-optimized; widen the type list / AZs |
| 2 | Chronic high interruption rate even when diversified | < 10 pools; capacity-optimized has nowhere to go |
aws ec2 get-instance-types-from-instance-requirements --query 'length(InstanceTypes)' < 10 |
Add families/sizes, span 3 AZs, or switch to ABIS |
| 3 | Fleet drops below survivable capacity during a Spot drought | on_demand_base_capacity = 0 (no floor) |
describe-auto-scaling-groups shows no On-Demand among instances |
Set base to one AZ’s worth; on_demand_allocation_strategy = prioritized |
| 4 | Instances reclaimed mid-drain; in-flight requests dropped | deregistration_delay >= 120s (ALB default 300) |
describe-target-group-attributes shows deregistration_delay.timeout_seconds = 300 |
Set it to 90; keep hook heartbeat ≤ 120 |
| 5 | Reclaim kills a worker mid-job; the job is lost | No drain path / non-idempotent ack | No EC2_INSTANCE_TERMINATING hook; queue ack happens before processing |
NTH/hook; for queues ack after success + visibility-timeout redelivery |
| 6 | Per-instance load skewed behind the LB | Weighted capacities not proportional to real vCPU | Compare WeightedCapacity in the policy to instance vCPU |
Weight large=2, xlarge=4, 2xlarge=8 |
| 7 | Scale-out is slow / “InsufficientInstanceCapacity” | Too few pools, or a Spot quota cap | ASG activity history shows capacity failures; check Spot vCPU quota | Diversify; raise the All Standard Spot quota |
| 8 | OOM/throttling after the group grabbed the “wrong” type | Mixed memory:vCPU ratios (c/r pulled in) |
Instance types in the group span families with different ratios | Bound memory_mib in ABIS; diversify within a profile |
| 9 | spot_instance_pools seems to do nothing |
It’s ignored by capacity-aware strategies | Strategy is price-capacity-optimized but spot_instance_pools is set |
Remove it; it only applies to lowest-price |
| 10 | ECS scale-in kills instances with running tasks | Managed termination protection off | Capacity provider managedTerminationProtection ≠ ENABLED |
Enable it so ECS drains tasks first |
| 11 | Karpenter churns nodes / stampedes workloads | No disruption budget or PDBs | NodePool has no disruption.budgets; workloads lack PDBs |
Set budgets: 10%; add PDBs; do-not-disrupt on long jobs |
| 12 | “Spot saves 90%” but the bill barely moved | Trusting the headline, not realized savings | CUR/Cost Explorer by Purchase Option shows little Spot usage | Increase Spot %, fix fallback-to-OD drought, track CUR |
| 13 | New m7i/c7i generation never used |
Hand-maintained override list is stale |
Policy lists only older generations | Switch to ABIS with instance_generations = ["current"] |
| 14 | Spot instances never launch; all fill as On-Demand | Sustained Spot unavailability or a too-low spot_max_price |
ASG instances all OnDemand; spot_max_price set low |
Clear spot_max_price (= OD cap); widen pools; check quota |
| 15 | FIS drill doesn’t drain; instance just disappears | Lifecycle hook missing or NTH not running on the instance | describe-lifecycle-hooks empty; NTH service not active in user data |
Add the EC2_INSTANCE_TERMINATING hook; install/start NTH at boot |
The expanded form for the entries that bite hardest:
1. A capacity event reclaims a huge slice of the fleet at once.
Root cause: lowest-price parks every Spot instance in the same one or two cheapest (type, AZ) pools, so when that pool is reclaimed, most of your fleet goes with it.
Confirm: Your EventBridge interruption rule shows a burst of EC2 Spot Instance Interruption Warning events all carrying the same instance-type + availability-zone; describe-auto-scaling-groups shows few distinct types running.
Fix: Switch spot_allocation_strategy to price-capacity-optimized and widen the type list across families/sizes and all AZs so placement weights spare capacity instead of chasing price.
2. Chronic high interruption rate even when you think you diversified.
Root cause: You have fewer than ~10 pools — four types in one AZ, or two types in three AZs — so capacity-optimized allocation has nothing to optimize.
Confirm: aws ec2 get-instance-types-from-instance-requirements ... --query 'length(InstanceTypes)' returns a small number, or your override list × AZ count is < 10.
Fix: Add AMD/Intel variants and adjacent sizes, span all three AZs, or move to ABIS with sane vCPU/memory bounds (often ~30 pools).
3. Fleet drops below survivable capacity during a Spot drought.
Root cause: on_demand_base_capacity = 0, so when Spot is broadly unavailable there is no guaranteed floor and capacity can fall to zero.
Confirm: describe-auto-scaling-groups --query 'AutoScalingGroups[0].Instances[].InstanceType' during a drought shows no On-Demand instances.
Fix: Set on_demand_base_capacity to one AZ’s worth of capacity units, and on_demand_allocation_strategy = "prioritized" so the floor fills predictably.
4. Instances reclaimed mid-drain; in-flight requests dropped.
Root cause: The target group’s deregistration_delay.timeout_seconds is the ALB default 300 s — longer than the entire two-minute notice — so connection draining never finishes before the instance is gone.
Confirm: aws elbv2 describe-target-group-attributes --target-group-arn <arn> shows deregistration_delay.timeout_seconds = 300.
Fix: Set it to 90 s (< 120), and keep the lifecycle-hook heartbeat-timeout at or under 120 s.
5. A reclaim kills a worker mid-job and the job is lost.
Root cause: There is no drain path (no EC2_INSTANCE_TERMINATING hook, no NTH), or — for queue work — the worker acks the message before processing, so a reclaim loses the in-flight job with no redelivery.
Confirm: aws autoscaling describe-lifecycle-hooks --auto-scaling-group-name <asg> is empty; or a code review shows DeleteMessage before the work completes.
Fix: Run NTH (or Karpenter on EKS) for VM/K8s fleets. For queue-driven work, ack only after success and set the SQS visibility timeout greater than the max job duration so a reclaimed message is redelivered to another worker.
Best practices
- Diversify to ≥ 10 pools before tuning anything else. Four-plus types across three AZs, or ABIS. Pool count is the single biggest determinant of your interruption rate.
- Use
price-capacity-optimizedfor production. It is strictly better thanlowest-priceand better than purecapacity-optimizedfor most workloads. Reservecapacity-optimized-prioritizedfor a genuine Savings-Plan preference. - Keep an On-Demand base sized to “survivable” capacity. One AZ’s worth is a good default for customer-facing tiers; the base is your blast-radius floor during a total Spot drought.
- Make weights proportional to real vCPU. large=2, xlarge=4, 2xlarge=8 — or per-instance load behind the load balancer skews and your smallest instances melt.
- Turn on Capacity Rebalance and pair it with a terminating hook. Proactive replacement plus a drain window beats racing the 120-second clock.
- Set
deregistration_delay < 120son Spot target groups. The default 300 is longer than the entire warning; 90 is a safe value. - Make the drain idempotent and shared between scale-in and interruption. One
EC2_INSTANCE_TERMINATINGhook covers both; for queue work, prefer visibility-timeout redelivery over bespoke lifecycle logic. - Prefer ABIS over a hand-maintained list. It picks up new generations automatically and maximizes pool count; always preview the resolved types before shipping.
- In containers, split via capacity providers / NodePool requirements. ECS: separate Spot/On-Demand providers with a
baseand managed termination protection. EKS: Karpenter withspotpreferred,on-demandfallback. - Set Karpenter disruption budgets and PDBs. Cap voluntary disruption (e.g. 10% of nodes) and protect intolerant pods with
do-not-disrupt. - Instrument the interruption signal from day one. An EventBridge rule feeding a per-pool interruption-rate metric is what you tune against; there is no built-in counter.
- Track realized savings from CUR, not the headline. Group Cost Explorer by Purchase Option and compare realized Spot cost to the On-Demand cost of the same usage.
Security notes
Spot instances are ordinary EC2 instances — the security posture is the same as any fleet, with a few Spot-specific angles around the drain automation’s permissions and the interruption signal:
- Least-privilege the drain automation. The NTH/instance role needs exactly
elasticloadbalancing:DeregisterTargets,elasticloadbalancing:DescribeTargetHealth, andautoscaling:CompleteLifecycleAction— not a broadautoscaling:*orelasticloadbalancing:*. Scope by resource ARN where possible. - Enforce IMDSv2 on the launch template. The interruption notice is read from IMDS; require token-authenticated IMDSv2 (
http-tokens = required, hop limit 1) so a compromised process or SSRF can’t trivially read instance credentials or metadata. - Don’t put secrets in user data. Diversified fleets relaunch instances constantly; pull secrets at boot from Secrets Manager/Parameter Store via the instance role, never bake them into the launch template’s user data.
- Lock down the EventBridge → Lambda/SQS path. The interruption-handling Lambda or queue-processor should run with a minimal role; an attacker who can publish fake interruption events shouldn’t be able to drive mass drains — restrict who can
PutEventsand validate event source. - Use a dedicated instance profile per fleet. Don’t share one over-broad role across Spot and On-Demand fleets; scope each to what that workload actually needs so a reclaimed-and-relaunched instance never carries more privilege than required.
- Encrypt EBS and instance store. Reclaimed instances are wiped, but enable EBS encryption (and instance-store encryption where supported) so data at rest on the volume is never exposed.
The Spot-specific security controls and what each prevents:
| Control | Mechanism | Prevents |
|---|---|---|
| Least-privilege drain role | Scoped IAM (DeregisterTargets, CompleteLifecycleAction) |
A compromised instance pinning/terminating the fleet |
| IMDSv2 required | Launch template http-tokens = required, hop limit 1 |
SSRF/credential theft via the metadata endpoint |
| Secrets at boot, not in user data | Secrets Manager / Parameter Store + instance role | Plaintext secrets in a frequently-relaunched template |
| Restricted EventBridge target role | Minimal Lambda/queue-processor permissions | Forged interruption events driving mass drains |
| Per-fleet instance profile | Distinct scoped roles | Privilege creep across mixed Spot/OD fleets |
| EBS/instance-store encryption | KMS-backed volume encryption | Data-at-rest exposure on a reclaimed volume |
Cost & sizing
Spot is a cost strategy, so “sizing” here means sizing the savings against the risk. The bill drivers:
- The Spot discount itself dominates the upside — 70–90% off On-Demand for the same hardware. Realized savings depend on how much of the fleet runs Spot vs the On-Demand base and the drought-driven fallback.
- The On-Demand base is the cost of safety. A larger base means a steadier fleet during a drought but a smaller discount; a tiny base maximizes savings but leans entirely on Spot availability above the floor. Size it to “survivable,” not “comfortable.”
- Fallback-to-On-Demand makes the bill variable. During a broad Spot drought the Spot portion fills as On-Demand at full price — your bill is not flat, and you should budget for the occasional drought premium rather than the best-case discount.
- Diversification is free and reduces cost variance. More pools means more time in cheap, deep pools and fewer expensive fallbacks — pool count improves both resilience and realized savings.
A rough monthly picture for a ~6-unit steady fleet (numbers illustrative, region-dependent):
| Configuration | OD : Spot mix | Rough monthly (₹) | vs all-On-Demand | Risk profile |
|---|---|---|---|---|
| All On-Demand | 100 : 0 | ~₹40,000 | baseline | No reclaim risk; no savings |
Spot, naive (lowest-price, 2 AZ) |
0 : 100 | ~₹6,000 | ~85% off | Mass reclaim risk — not production |
| Spot, diversified, no base | 0 : 100 | ~₹7,000 | ~82% off | Cheapest safe; floor only via fallback |
| Spot, diversified, 20% OD base | ~20 : 80 | ~₹12,000 | ~70% off | Production default; survivable floor |
| Spot, diversified, 30% OD | ~30 : 70 | ~₹15,000 | ~62% off | Reclaim-sensitive tiers |
What each cost lever buys you:
| Lever | Cost effect | What it buys | Watch-out |
|---|---|---|---|
| Larger OD base | +cost | Survives a longer/total Spot drought | Diminishing savings past “survivable” |
| Higher Spot % above base | −cost | More of the discount | More exposure to reclaim waves |
| More pools (diversify/ABIS) | ~free | Lower variance, more time in cheap pools | Needs memory bounds to stay correct |
price-capacity-optimized |
~free | Cheap Spot at low churn | Slightly pricier than lowest-price by design |
| Graviton (arm64) pools | −cost | Cheaper, deep pools | Needs a multi-arch build |
There is no separate free tier for Spot — the savings are the discount. The honest way to report them is realized Spot cost vs the On-Demand cost of the same usage, queried from the Cost and Usage Report, not the “up to 90%” headline.
Interview & exam questions
1. What is an EC2 Spot capacity pool, and why does it drive diversification? A capacity pool is one (instance type, Availability Zone) combination in a Region; Spot prices and availability are set per pool, and AWS reclaims capacity within a pool. If your whole fleet lives in one pool, a single reclaim takes it all; spread across many pools, a reclaim trims a small fraction. So the core strategy is to draw from as many pools as possible.
2. Compare lowest-price, capacity-optimized, and price-capacity-optimized. Which is the production default? lowest-price picks the cheapest pools and has the highest interruption rate. capacity-optimized picks the deepest-capacity pools (lowest interruptions) but ignores price. price-capacity-optimized balances low price with deep capacity and is the recommended default for almost everything — cheaper than pure capacity-optimized for most workloads, and far more stable than lowest-price.
3. What do on_demand_base_capacity and on_demand_percentage_above_base_capacity do? The base is an absolute count (in capacity units if you use weights) of On-Demand instances the group always maintains — your floor that survives a total Spot drought. The percentage governs, of everything launched above the base, what fraction is On-Demand vs Spot. Together they carve the fleet into a guaranteed floor and a Spot-heavy remainder.
4. Difference between a rebalance recommendation and a Spot interruption notice? The rebalance recommendation is an early, best-effort advisory that an instance is at elevated risk — it can arrive minutes ahead and may not be followed by an interruption. The interruption notice is the hard ~2-minute warning that the instance will be reclaimed. Capacity Rebalance acts on the former to replace proactively; you drain on the latter.
5. Why must deregistration_delay be under 120 seconds for a Spot fleet? Once the interruption notice fires you have ~120 seconds before the instance is gone. The ALB default deregistration delay is 300 seconds — longer than the whole warning — so connection draining never completes and in-flight requests are dropped. Setting it to ~90 seconds lets the drain finish inside the notice.
6. How does Capacity Rebalance change interruption handling? With capacity_rebalance = true, the ASG launches a replacement instance when it receives a rebalance recommendation — before the hard two-minute notice — so you’re not racing the 120-second clock to find capacity. Paired with a terminating lifecycle hook, the old instance drains gracefully while the replacement warms.
7. What is attribute-based instance selection (ABIS) and why use it? Instead of hand-listing instance types, you declare requirements (vCPU range, memory range, exclusions like burstable) and EC2 expands them into every matching current and future type. It future-proofs the policy (new generations like m7i are picked up automatically) and maximizes pool count, while memory bounds keep the selector from grabbing starved or bloated families.
8. How do you run Spot safely on ECS? Attach two capacity providers — one Spot, one On-Demand — to a mixed-instances ASG and split via a default strategy with a base (always On-Demand) and a weight ratio. Enable managedTerminationProtection so ECS drains tasks off an instance before the ASG terminates it. For serverless, FARGATE_SPOT gives the same split with no instances to manage.
9. What’s the cheapest reliable “drain” for queue-driven Spot workers? Idempotency plus the SQS visibility timeout. Acknowledge a message only after the work succeeds, and set the visibility timeout greater than the maximum job duration. On a reclaim, the worker stops pulling new messages and finishes the current one; if it dies first, the message reappears after the timeout and another worker picks it up — no lifecycle-hook gymnastics needed.
10. Why does naive Spot pass in testing but fail catastrophically in production? Under low test load and normal conditions, even a poorly diversified fleet (e.g. lowest-price, two pools) runs fine. The failure only manifests during a real regional capacity crunch, when the one or two pools you concentrated into are reclaimed at once — taking a large fraction of the fleet with no warning-driven drain. The fix is diversification + price-capacity-optimized + a drain path, validated with FIS.
11. How do you measure your real Spot savings? Not from the “up to 90%” headline. Query the Cost and Usage Report (CUR) for the Spot effective price per line item and compare realized Spot cost to the On-Demand cost of the same usage; in Cost Explorer, group by Purchase Option to see the On-Demand / Spot / Reserved split and watch for drift.
12. On EKS with Karpenter, what controls prevent Spot-driven churn? The disruption settings: consolidationPolicy and consolidateAfter govern how aggressively Karpenter consolidates, and disruption.budgets cap how many nodes it voluntarily disrupts at once (e.g. 10%). Add Pod Disruption Budgets and karpenter.sh/do-not-disrupt on intolerant pods so consolidation never breaches availability.
These map to the AWS Certified Solutions Architect – Associate (SAA-C03) — design cost-optimized and resilient architectures — and AWS Certified Solutions Architect – Professional (SAP-C02) and DevOps Engineer – Professional (DOP-C02) for the deeper Auto Scaling, ECS/EKS, and FIS content. A compact cert mapping:
| Question theme | Primary cert | Exam objective area |
|---|---|---|
| Pools, allocation strategies, OD base | SAA-C03 | Design cost-optimized & resilient architectures |
| Capacity Rebalance, lifecycle drain, FIS | DOP-C02 | Resilient cloud solutions; fault injection |
| ECS capacity providers, Karpenter | SAP-C02 / DOP-C02 | Continuous delivery; container platforms |
| ABIS, weighted capacity | SAA-C03 / SAP-C02 | Cost optimization; compute selection |
| CUR / Cost Explorer savings tracking | SAA-C03 | Cost management & FinOps |
Quick check
- You flip a production ASG to 100% Spot on
lowest-priceacross two instance types in two AZs. What is the specific risk, and what’s the one allocation-strategy change that most reduces it? - Your ALB target group has the default
deregistration_delay. Why does this silently break your Spot drain, and what value do you set? - True or false: scaling out to more On-Demand instances is the right way to survive a total Spot drought.
- A queue-driven Spot worker loses jobs when it’s reclaimed mid-processing. Without adding checkpointing, how do you make the work survive?
- You hand-maintain a list of eight instance types and a new
m7igeneration ships. What feature avoids your list going stale, and what must you always do before shipping it?
Answers
- With only ~4
(type, AZ)pools andlowest-price, the whole fleet concentrates into the one or two cheapest pools; a regional capacity crunch reclaims a large fraction at once. The highest-leverage change isspot_allocation_strategy = "price-capacity-optimized"(and then widening the type list and AZs to ≥ 10 pools). - The ALB default is 300 seconds, longer than the entire two-minute interruption notice, so connection draining never finishes and in-flight requests drop. Set
deregistration_delayto 90 seconds (< 120). - False. Scaling out adds more Spot capacity that the same drought can’t fill; the floor that survives a drought is the On-Demand base (
on_demand_base_capacity), sized to one AZ’s worth, withon_demand_allocation_strategy = "prioritized". - Move the acknowledgement to after the work succeeds and set the SQS visibility timeout greater than the maximum job duration. On a reclaim the message reappears after the timeout and another worker reprocesses it — idempotency plus visibility timeout is the drain.
- Attribute-based instance selection (ABIS) with
instance_generations = ["current"]picks upm7iautomatically. Always preview the resolved types first withaws ec2 get-instance-types-from-instance-requirementsso you know exactly which pools the requirement set expands to.
Glossary
- Spot Instance — spare EC2 capacity sold at a steep discount (70–90% off On-Demand) that AWS can reclaim with a ~2-minute notice.
- Capacity pool — one
(instance type, Availability Zone)combination in a Region; Spot price and availability are set per pool, and reclaims happen per pool. - Mixed instances policy — an ASG configuration that draws from many instance types and blends On-Demand with Spot in one group.
- Allocation strategy — the algorithm choosing which pools Spot launches into:
lowest-price,capacity-optimized,capacity-optimized-prioritized, orprice-capacity-optimized. price-capacity-optimized— the recommended default strategy; balances low price with deep capacity to minimize interruptions without ignoring cost.- On-Demand base capacity — an absolute count (in capacity units) of On-Demand instances the group always maintains; the floor that survives a total Spot drought.
- Weighted capacity — a per-instance-type “units” value so the ASG reasons in vCPU (or similar) rather than instance count.
- Rebalance recommendation — an early, best-effort advisory that an instance is at elevated interruption risk; acted on when
capacity_rebalance = true. - Spot interruption notice — the hard ~2-minute warning that an instance will be reclaimed, delivered via IMDS (
/spot/instance-action) and EventBridge. - Capacity Rebalance — an ASG setting that launches a replacement on a rebalance recommendation, before the hard notice.
- Lifecycle hook — an ASG mechanism that pauses an instance in
Terminating:Waitto run a drain before termination; one hook covers scale-in and interruption. - AWS Node Termination Handler (NTH) — open-source software that watches IMDS/EventBridge for Spot signals and drives a graceful drain (IMDS mode for VMs, queue-processor mode for EKS).
deregistration_delay— the ALB/NLB target-group connection-draining timeout; must be< 120sfor Spot (the default 300 is too long).- Attribute-based instance selection (ABIS) — declaring instance requirements (vCPU/memory ranges, exclusions) so EC2 expands them into every matching current and future type.
- Capacity provider (ECS) — an ECS construct backed by an ASG (or Fargate) that, with a
base+weightstrategy, splits tasks across On-Demand and Spot. FARGATE_SPOT— the serverless Spot capacity provider; ~70% off Fargate On-Demand with a 2-minute SIGTERM-then-drain contract.- Karpenter — an EKS node-provisioning controller that handles Spot natively, bin-packs aggressively, and exposes disruption budgets and consolidation controls.
- Disruption budget (Karpenter) — a cap on how many nodes Karpenter voluntarily disrupts at once, preventing consolidation stampedes.
- Cost and Usage Report (CUR) — the granular AWS billing export carrying the Spot effective price per line item; the honest source for realized savings.
- AWS Fault Injection Service (FIS) — a managed chaos-engineering service that can fire a real Spot interruption (
aws:ec2:send-spot-instance-interruptions) to validate your drain.
Next steps
You can now put interruption-tolerant production workloads on Spot, minimize the interruption rate, keep a survivable floor, and drain cleanly. Build outward:
- Next: Advanced EC2 Auto Scaling: Warm Pools, Lifecycle Hooks, and Zero-Downtime Instance Refresh — the lifecycle and refresh mechanics this article’s drain path builds on.
- Related: EC2 Auto Scaling, In Depth: Launch Templates, ASGs, Scaling Policies & Lifecycle Hooks — the foundations under every mixed-instances policy.
- Related: AWS Elastic Load Balancing, In Depth: ALB, NLB, GWLB & Target Groups — target-group draining is half of safe Spot.
- Related: Production Amazon ECS on Fargate: Task Networking, Auto Scaling, and Safe Rolling Deployments — run Spot under ECS with capacity providers and
FARGATE_SPOT. - Related: Migrating to Graviton: arm64 Builds, Multi-Arch Pipelines, and Performance Benchmarking — add deep, cheap arm64 Spot pools to your diversification.
- Related: Resilient Messaging with SQS and SNS: Fan-Out, FIFO Ordering, DLQs, and Poison-Message Handling — the visibility-timeout redelivery that makes queue-driven Spot nearly free to drain.
- Related: FinOps Showback and Chargeback Platform on AWS — track realized Spot savings from the CUR across teams.