Architecture AWS

AWS Cloud Adoption Framework: Operations Perspective — Observability, AIOps Event Management, Incident/Problem, Change/Release/Config, Performance/Capacity, Availability/Continuity, and Patch Management

Where this fits

The AWS Cloud Adoption Framework organizes cloud transformation into six perspectives — Business, People, Governance, Platform, Security, and Operations — and the Operations perspective is the one that ensures cloud services are delivered at a level that meets the needs of your business. Where Platform builds the landing zone and the workloads, Operations is what keeps them running, observable, and recoverable in production; its stakeholders are the infrastructure and operations leaders, site-reliability engineers, IT service managers, and the platform/SRE teams who carry the pager. This is part 7 of the series, and it goes deep on seven of the Operations capabilities — observability, event management (AIOps), incident and problem management, change/release/configuration management, performance and capacity management, availability and continuity, and patch management — the disciplines that turn “we deployed it” into “we operate it to an SLO.” Operations is also where the abstract promise of operational resilience from the Business perspective becomes a measured, defended number, and where the Well-Architected Operational Excellence and Reliability pillars become day-2 practice rather than a design review checkbox.

AWS Cloud Adoption Framework — animated overview

Observability

What it is. Observability is the capability of gaining visibility into the state and behaviour of your workloads — not just whether a host is up, but whether the user-facing service is healthy and why it is or is not. AWS frames it around the three classic telemetry signals — metrics, logs, and traces — plus the curation of dashboards, service-level objectives (SLOs), and synthetic checks that turn raw signal into operational truth. The distinction from plain monitoring matters: monitoring answers known questions (“is CPU above 80%?”); observability lets you ask new questions of a system you did not anticipate (“why are only checkout requests from the EU edge slow, and only since the 14:05 deploy?”).

Why it matters. Every other capability in this perspective depends on it. You cannot manage an incident you cannot see, set a capacity threshold you cannot measure, or prove an availability SLO you do not instrument. Observability is also the single biggest determinant of MTTD (mean time to detect) and a major lever on MTTR (mean time to resolve) — and in a distributed, microservice, event-driven AWS estate, the failure modes are emergent and partial, so host-up/host-down monitoring is actively misleading. The goal is to detect degradation from the customer’s viewpoint before a customer reports it.

How to do it well. Standardize on structured, correlated telemetry and adopt OpenTelemetry so instrumentation is vendor-portable. Concretely:

Artifacts, decisions, and AWS tooling.

Telemetry signal What you produce Primary AWS service
Metrics Golden-signal + business metrics, anomaly bands Amazon CloudWatch, CloudWatch EMF, AMP/Prometheus
Logs Structured JSON logs with correlation IDs, Logs Insights queries CloudWatch Logs, OpenSearch Service
Traces Service map, p99 segment latency AWS X-Ray, AWS Distro for OpenTelemetry (ADOT)
SLOs / SLIs Per-service SLO objects, error budgets, burn-rate alarms CloudWatch Application Signals
Synthetic / RUM Canaries on critical journeys, real-user metrics CloudWatch Synthetics, CloudWatch RUM
Dashboards Service health + roll-up dashboards CloudWatch Dashboards, Amazon Managed Grafana

The decisions to make explicit: your telemetry standard (OTel + structured logs + mandatory correlation ID), your retention/cost tiers (hot in CloudWatch, warm in OpenSearch, cold in S3 — observability cost is real and runs away silently), and what an SLO actually is for each tier-1 service, because that number anchors incident severity, change risk, and capacity planning downstream.

Event management (AIOps)

What it is. Event management is detecting events, assessing their potential impact, and determining the appropriate control action — and at scale this becomes AIOps: using machine learning to reduce the operational noise that a large AWS estate generates so humans only see signal. An event is any observable change of state (an alarm firing, a config drift, a deployment, a quota breach); event management is the pipeline that ingests, deduplicates, correlates, enriches, prioritizes, and routes those events — and ideally auto-remediates the well-understood ones.

Why it matters. A mature account structure emits a torrent of events. Without correlation you drown: one root cause (a saturated NAT gateway) fans out into forty downstream CloudWatch alarms across ten services, and the on-call engineer pages on the symptoms, not the cause. AIOps attacks alert fatigue and shrinks MTTD/MTTR by collapsing that storm into a single, root-cause-tagged, actionable event. It is also the bridge between observability (which produces signal) and incident management (which responds to it).

How to do it well. Build an event-driven pipeline and apply ML where it earns its keep:

Artifacts, decisions, and AWS tooling.

Stage What it does AWS service
Event bus Ingest and route all operational events Amazon EventBridge
ML correlation Root-cause insights, anomaly + related-event grouping Amazon DevOps Guru, CloudWatch anomaly detection
Enrichment Add change/owner/runbook context EventBridge input transformer, Lambda
Auto-remediation Deterministic fixes for known events SSM Automation runbooks, AWS Config remediation, Lambda
Proactive Maintenance, deprecation, limits, resilience AWS Health, AWS Trusted Advisor

Key decisions: which events are auto-remediated vs paged (start conservative, promote to auto only after the runbook is proven), your deduplication/correlation strategy (DevOps Guru insight as the primary alert unit, not the raw alarm), and a noise budget — track alert-to-action ratio and treat a low ratio as a defect to fix, not background hum.

Incident and problem management

What it is. Two distinct disciplines AWS deliberately separates. Incident management restores service operation as quickly as possible (it is about recovery and is time-bounded by an SLA). Problem management identifies and addresses the root causes of incidents to prevent recurrence (it is about learning and is not time-bounded). One asks “how do we make it work again now?”; the other asks “why did it break, and how do we make sure it never breaks that way again?”

Why it matters. Conflating them is the classic operations failure: teams firefight the same incident monthly because the post-incident “problem” work never happens. Separating them lets you optimize incident response for speed (clear severities, defined on-call, a war room, a single incident commander) while running a parallel, blameless learning loop that permanently retires recurring failure classes — the only thing that bends the long-run incident curve downward.

How to do it well.

Artifacts, decisions, and AWS tooling.

Artifact What it captures AWS service
Severity matrix + SLAs Sev levels, response targets, escalation, comms Documented standard; tied to CloudWatch SLOs
Response plans Auto-create incident, page, runbooks, chat SSM Incident Manager + EventBridge + AWS Chatbot
On-call schedule Rotations, contacts, escalation Incident Manager engagement/escalation
Incident timeline Auto-captured actions, decisions Incident Manager
PIR / COE Root cause, corrective actions, owners, dates Documented; backlog-tracked
Runbook library Versioned automation + human runbooks SSM Automation / SSM Documents

The decisions that matter: the severity definition (anchor it to customer/SLO impact, not internal opinion), who can declare a Sev-1 (anyone, no permission needed — false alarms are cheaper than delayed response), and a hard rule that problem-management corrective actions are sprint-committed work, not a wiki page nobody reads.

Change, release, and configuration management

What it is. Three intertwined capabilities. Change management introduces, modifies, or removes anything that could affect production in a controlled way. Release management plans, schedules, and controls the build, test, and deployment of changes into production. Configuration management maintains an accurate record of the configuration of your resources and their relationships (your CMDB). Together they answer “what is allowed to change, how do we ship it safely, and do we know exactly what is running right now?”

Why it matters. In the cloud, the temptation is to swing between two failure modes: heavyweight ITIL change-advisory-board gates that throttle deployment frequency (and push people to make changes out-of-band), or a free-for-all where undocumented console changes cause drift, surprise outages, and an estate nobody can describe to an auditor. The mature answer is deployment safety through automation and progressive delivery — make the safe path the easy path — plus everything-as-code so the configuration record is the deployment mechanism, not a stale spreadsheet maintained by hand.

How to do it well.

Artifacts, decisions, and AWS tooling.

Discipline Artifact AWS service
Change Change policy (standard/normal/emergency), approval gates CodePipeline approvals, Service Catalog, SCPs
Release Pipelines, canary/blue-green, auto-rollback CodePipeline, CodeDeploy, CodeBuild
Configuration (IaC) CloudFormation/CDK/Terraform repos, drift detection CloudFormation, CDK, drift detection
Configuration record Resource config + relationships, timeline, compliance AWS Config, Config aggregator, conformance packs
In-guest inventory OS/app/patch inventory across fleet SSM Inventory

DORA’s four metrics — deployment frequency, lead time for changes, change failure rate, time to restore — are the scoreboard here; AWS exposes the pipeline data to compute them. The decision to make: define your standard-change catalogue aggressively, because every change you can safely auto-approve is a change that ships faster and is fully recorded.

Performance and capacity management

What it is. Performance management ensures cloud services meet performance expectations (latency, throughput, the SLIs you committed to), while capacity management ensures sufficient capacity is available to meet demand — neither starving the service (breaching SLOs) nor over-provisioning (burning money). In the cloud the two are deeply linked because capacity is elastic and priced per-second, so the question is rarely “do we own enough servers?” and almost always “are we scaling the right resource on the right signal at the right cost?”

Why it matters. Get capacity wrong upward and you waste a fortune on idle reserved fleet; get it wrong downward and you breach your latency SLO during the exact peak (sale, launch, quarter-end) that matters most. Performance management is also where you catch the slow regression — the deploy that quietly added 40ms to p99 — before it compounds into an incident. And capacity in AWS has a non-obvious dimension: service quotas and account limits, which silently cap you long before your code does.

How to do it well.

Artifacts, decisions, and AWS tooling.

Concern What you produce AWS service
Elastic scaling Scaling policies (target-tracking/predictive) EC2 Auto Scaling, Application Auto Scaling, Karpenter/KEDA
Right-sizing Utilization-based resize recommendations AWS Compute Optimizer, Cost Explorer
Load/perf testing Pre-peak load test results, latency budgets Distributed Load Testing, AWS FIS, X-Ray, App Signals
Quota/capacity Quota dashboard, capacity reservations for peaks Service Quotas, On-Demand Capacity Reservations, Capacity Blocks
Forecast Capacity forecast tied to business calendar CloudWatch metrics + business-event calendar

The decisions worth pinning: the scaling signal per service (CPU is often the wrong one — scale on queue depth, concurrency, or p99 latency), the headroom target (how much spare capacity above forecast peak you carry), and where you deliberately trade cost for guaranteed capacity (Capacity Reservations) versus accept autoscaling risk.

Availability and continuity

What it is. Availability management ensures cloud services are available as needed (designing and operating to meet an availability/SLO target), and continuity management ensures business operations continue during and after a disruption — disaster recovery and business continuity. This is where you commit to numbers: an availability target (e.g., 99.95%), a Recovery Time Objective (RTO), and a Recovery Point Objective (RPO), and then architect and prove you meet them.

Why it matters. Availability and continuity are the operational expression of the Well-Architected Reliability pillar and the entire reason the Operations perspective exists — “delivered at a level that meets the needs of the business” is an availability commitment. The trap is treating DR as a binder that is never tested; an untested DR plan is a hypothesis, and you discover during the real outage that the runbook is stale, the backups don’t restore, or the failover region was missing a quota. Continuity is only real if it is rehearsed.

How to do it well.

DR strategy RTO / RPO What runs in the recovery region Typical use
Backup & restore Hours / hours Nothing; restore from backups Tier-3 workloads, cost-sensitive
Pilot light 10s of minutes / minutes Core data replicated, servers off Tier-2 workloads
Warm standby Minutes / seconds Scaled-down full stack, always on Tier-1 business apps
Multi-site active/active Near-zero / near-zero Full stack live in 2+ regions Tier-0, can’t-go-down

Artifacts, decisions, and AWS tooling.

Artifact What it captures AWS service
Availability/SLO targets Per-tier availability %, error budget CloudWatch Application Signals
RTO/RPO + DR strategy Per-workload tier → DR pattern AWS Resilience Hub (policy + score)
HA architecture Multi-AZ, ELB, failover routing Route 53 ARC, ELB, Auto Scaling
Backup policy Schedules, retention, cross-region, immutability AWS Backup, Backup Vault Lock
Replication Cross-region data continuity Aurora/DynamoDB global tables, S3 CRR
Resilience testing Game-day results, chaos experiments, achieved RTO/RPO AWS FIS, DR game days

The core decisions: a workload-to-tier-to-DR-pattern map (not everything needs active/active — match the pattern to the business cost of downtime), the immutability/ransomware posture on backups (Vault Lock is non-negotiable for tier-1), and a mandatory DR test cadence with achieved-vs-target RTO/RPO as a reported KPI.

Patch management

What it is. Patch management is distributing and applying software updates — OS and application patches, AMI refreshes, and runtime/dependency updates — to keep the estate secure, compliant, and stable. It sits in Operations (delivery and stability) but is joined at the hip with the Security perspective’s vulnerability management; an unpatched fleet is both an availability risk (known crash bugs) and the most common breach vector.

Why it matters. Patching is where good intentions go to die at scale: a handful of servers is trivial, but a few thousand instances across dozens of accounts, with maintenance windows, blast-radius concerns, and “we can’t reboot the payments host during business hours” constraints, is a genuine operational program. Drift here is silent and cumulative — six months of skipped patches is how a single CVE turns into a ransomware event. The mature posture is automated, reported, exception-managed patching with immutable replacement preferred over in-place patching wherever the architecture allows.

How to do it well.

Artifacts, decisions, and AWS tooling.

Concern What you produce AWS service
In-place patching Baselines, patch groups, maintenance windows SSM Patch Manager, State Manager, Fleet Manager
Immutable patching Scheduled golden-AMI / image rebuild + roll-out EC2 Image Builder + deployment pipeline
Vulnerability feed Continuous CVE scan of EC2/ECR/Lambda Amazon Inspector
Managed-service patching Planned engine-version upgrade calendar RDS/Aurora/EKS/ElastiCache version upgrades
Compliance & exceptions % compliant, exception register with expiry, patch SLA Patch Manager compliance, AWS Config, Security Hub

The decisions: in-place vs immutable per workload class (immutable for everything autoscaled/containerized; in-place only for true pets), your patch SLA by severity (tie it to Inspector’s scoring and your Security perspective’s vuln-management policy), and the maintenance-window strategy that respects business-critical hours while still hitting the SLA.

Real-world enterprise scenario

Aurelius Health Systems is a fictional pan-India digital-health provider: ~₹9,400 crore revenue, a patient-facing app and clinician portal, a telemedicine platform, and a claims/billing back office, all on AWS across 42 accounts under a Control Tower landing zone, organized into a Patient Apps OU, a Clinical Platform OU, and a Corporate/Claims OU. The platform/SRE org runs roughly 2,600 EC2 instances, 180 ECS/EKS services, and a mix of Aurora, DynamoDB, and OpenSearch. The VP of Engineering sponsors an Operations-perspective uplift after a 3-hour telemedicine outage (a saturated NAT gateway that took 90 minutes just to diagnose) breached their patient-facing SLA. Here is how each capability plays out.

Measurable outcome (9 months in): MTTD on the NAT-class failure dropped from 90 minutes to under 5 (single DevOps Guru insight + correlation IDs); telemedicine availability moved from 99.7% to 99.96%, inside SLO; change-failure-rate fell from 19% to 7% via blue/green + auto-rollback; a Q3 DR game day hit RTO 12 min against a 15-min target; compute spend dropped 14% from right-sizing; and patch compliance for critical CVEs reached 98% within the 7-day SLA. The 3-hour outage class has not recurred — the corrective action (NAT-gateway autoscaling + a DevOps Guru insight + a proven runbook) permanently retired it.

Deliverables & checklist

Common pitfalls

What’s next

This is the final perspective in the series — with Operations covered alongside Business, People, Governance, Platform, and Security, the next part returns to the Envision–Align–Launch–Scale loop to assemble these capabilities into a single, sequenced transformation roadmap and CAF Action Plan.

AWSCloud Adoption FrameworkOperations PerspectiveEnterprise
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

// part 7 of 7 · AWS Cloud Adoption Framework

Keep Reading