This is the final lesson in the Tier 5 wave, and it deliberately closes the loop. Across nine deep-dives we’ve built up an automation platform that is compliant (D1), disaster-resilient (D2), capable of bulk migrations (D3), able to operate in air-gapped enclaves (D4), integrated with complex stacks like SAP (D5), capable of fleet operation at edge scale (D6), governed by ITSM (D7), backed up by tested immutable backups (D8), and able to migrate databases without downtime (D9). All of that is meaningless without one final ingredient: the system has to know whether it is healthy.
A platform that runs correctly 99.97% of the time but cannot tell you which 0.03% failed is not a platform — it is a black box that occasionally surprises everyone. The thesis of this lesson is that automation observability is not a “nice to have” added later; it is the foundation that makes everything else trustworthy at scale. When your CHG-gated, evidence-bundled, ServiceNow-tracked, SLA-verified automation has 50,000 runs per quarter, the only way to know it is working is metrics, logs, and traces that aggregate into a single view answerable in seconds.
The four pillars of automation observability:
- Metrics — counters, gauges, histograms about playbook runs, AAP control plane, hosts, and ITSM/CHG flow
- Logs — structured stdout/stderr from every play, every task, indexed and queryable
- Traces — distributed traces through multi-step orchestrations (workflow → job → host → task)
- Events — discrete state-change events (CHG opened, job launched, EDA rule fired) correlated with the above
The toolchain we will assemble:
| Pillar | Tool | Why this choice |
|---|---|---|
| Metrics ingestion | Prometheus + AAP /api/v2/metrics/ |
AAP exposes a Prometheus endpoint natively |
| Metrics storage | Mimir (or Cortex/Thanos) | Long-term, multi-tenant, queryable |
| Logs | Loki + promtail / Vector | Aligned with Grafana stack; cheap; label-based |
| Traces | Tempo + OpenTelemetry | OTel’s Ansible callback plugin is officially supported |
| Visualization | Grafana | Single pane of glass across metrics, logs, traces |
| Alerting | Alertmanager + Grafana Alerting | Routes to Slack/Teams/PagerDuty/ServiceNow |
| Closed loop | EDA rulebooks subscribed to alerts | Alerts auto-trigger remediation playbooks |
This is the canonical CNCF observability stack, with the caveat that you can substitute Datadog, New Relic, Splunk, or Elastic at the storage layer without changing the patterns in this lesson. The instrumentation contract (what to emit) is the durable part; the storage choice is replaceable.
1. The four golden signals, applied to automation
Google’s SRE book defines four golden signals for any service: latency, traffic, errors, saturation. Translated to an automation platform:
| Signal | What it means for AAP/Ansible | What you measure |
|---|---|---|
| Latency | How long playbooks take | p50/p95/p99 job duration; per-template, per-host |
| Traffic | How many jobs run | jobs/hour, jobs/template, jobs/inventory |
| Errors | How many fail | failure rate, error class breakdown, time-to-failure |
| Saturation | How busy the control plane is | execution-environment queue depth, capacity utilisation |
These four signals at the platform level give you the macro view. But automation has a fifth signal that conventional services don’t: convergence. Did the automation actually achieve its desired state, or did it merely “complete”?
A playbook that “succeeds” but fails to change anything because it was misconfigured is a successful run that produced a wrong outcome. Convergence means measuring not just job.status == 'successful' but job.changed_count > 0 AND desired_state == observed_state after the run. We’ll wire this in.
2. The Ansible callback plugin: the foundation of instrumentation
Every metric, log, and trace in this lesson originates from the same place: a callback plugin that fires on every Ansible event. Red Hat ships an official OpenTelemetry callback plugin in ansible.posix:
# ansible.cfg or AAP execution environment env
[defaults]
callbacks_enabled = ansible.posix.opentelemetry, ansible.posix.profile_tasks
[callback_opentelemetry]
otel_service_name = ansible-aap
enable_from_environment = OTEL_EXPORTER_OTLP_ENDPOINT
hide_task_arguments = true
# Environment variables (set in execution environment)
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.kv.local:4318
OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
OTEL_SERVICE_NAME=ansible-aap
When this is enabled, every Ansible run produces a complete OpenTelemetry trace with this hierarchy:
playbook span (root, named after the playbook)
└── play span (one per play)
└── task span (one per task per host)
├── attributes: ansible.task.module=template, host=foo, status=ok/changed/failed
└── events: stderr lines as span events
A nginx_install.yml playbook with 12 tasks running across 8 hosts will produce roughly 1 + 1 + 12*8 = 98 spans, all linked. In Tempo this is queryable as “show me all task failures for nginx_install in the last 7 days,” and you get back exact module name, host, task, exception text, and parent context.
The tradeoff is volume. A workflow that orchestrates 200 playbooks across 5,000 hosts produces hundreds of thousands of spans per run. Sample aggressively in production:
# OTel collector config
processors:
tail_sampling:
decision_wait: 30s
policies:
- name: errors-always
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces-always
type: latency
latency: { threshold_ms: 30000 }
- name: sample-others
type: probabilistic
probabilistic: { sampling_percentage: 5 }
This keeps every error trace and every slow trace, samples 5% of normal traces, and discards the rest. Storage cost drops by 95% with effectively zero loss of debugging value.
2.1 Custom callback for non-OTel workflows
Sometimes you need metrics or logs that don’t naturally fit into trace span attributes — for example, a fleet-wide compliance score, or the count of hosts behind on patches. For these, write a thin custom callback that emits to Prometheus pushgateway or directly to a metrics endpoint:
# callback_plugins/kv_metrics.py
from ansible.plugins.callback import CallbackBase
from prometheus_client import CollectorRegistry, Counter, Histogram, push_to_gateway
class CallbackModule(CallbackBase):
CALLBACK_VERSION = 2.0
CALLBACK_TYPE = 'aggregate'
CALLBACK_NAME = 'kv_metrics'
def __init__(self):
super().__init__()
self.registry = CollectorRegistry()
self.task_counter = Counter(
'ansible_task_total', 'Tasks executed',
['template', 'play', 'task', 'status'],
registry=self.registry,
)
self.task_duration = Histogram(
'ansible_task_duration_seconds', 'Task duration',
['template', 'task'],
buckets=[0.1, 0.5, 1, 5, 10, 30, 60, 300, 1800],
registry=self.registry,
)
def v2_runner_on_ok(self, result):
self._record(result, 'ok')
def v2_runner_on_failed(self, result, ignore_errors=False):
self._record(result, 'failed')
def v2_runner_on_skipped(self, result):
self._record(result, 'skipped')
def _record(self, result, status):
labels = {
'template': os.environ.get('TOWER_JOB_TEMPLATE_NAME', 'cli'),
'play': result._task._role._role_name if result._task._role else 'no-role',
'task': result._task.get_name(),
'status': status,
}
self.task_counter.labels(**labels).inc()
# ...
def v2_playbook_on_stats(self, stats):
push_to_gateway(
os.environ['PROMETHEUS_PUSHGATEWAY'],
job=os.environ.get('TOWER_JOB_TEMPLATE_NAME', 'cli'),
registry=self.registry,
)
Drop this in your execution environment’s callback_plugins/, list it in callbacks_enabled, set PROMETHEUS_PUSHGATEWAY=https://pushgateway.kv.local:9091, and every playbook run will emit task-level counters and histograms. This is the foundation for any custom metric you want.
3. AAP control plane metrics
AAP exposes a Prometheus-compatible metrics endpoint at /api/v2/metrics/. A minimal scrape config:
# prometheus.yml
scrape_configs:
- job_name: aap-controller
metrics_path: /api/v2/metrics/
bearer_token: '{{ aap_metrics_token }}'
scheme: https
static_configs:
- targets: ['aap.kv.local:443']
relabel_configs:
- source_labels: [__address__]
target_label: aap_instance
The metrics AAP exposes natively (excerpts):
awx_status_total{state="successful|failed|canceled|error"}— job result countersawx_running_jobs— current parallelismawx_pending_jobs— queued jobs (saturation signal)awx_instance_consumed_capacity— how busy each control node isawx_database_connections_total— Postgres connection pool usageawx_inventory_total/awx_organization_total— countsawx_subscription_total{type="hosts|nodes"}— license utilisation
The two metrics most worth alerting on:
# alert: control plane saturation
- alert: AAPControlPlaneSaturated
expr: |
avg by (aap_instance) (awx_instance_consumed_capacity)
/ avg by (aap_instance) (awx_instance_total_capacity)
> 0.85
for: 15m
labels:
severity: warning
annotations:
summary: "AAP control plane > 85% capacity for 15 minutes"
description: "Schedule capacity scale-up; queue depth is rising."
# alert: job failure rate spike
- alert: AAPJobFailureRateSpike
expr: |
sum(rate(awx_status_total{state="failed"}[5m]))
/ sum(rate(awx_status_total[5m]))
> 0.10
for: 10m
labels:
severity: critical
annotations:
summary: "AAP job failure rate > 10% over 10 minutes"
description: "Investigate template or environment regression."
These are the only two AAP-level alerts most teams need. Per-template alerts are usually too noisy and end up disabled within a quarter.
4. The unified dashboard taxonomy
A common failure mode is “we have 200 Grafana dashboards and nobody knows which one to open during an incident.” The fix is a strict three-layer dashboard taxonomy:
| Layer | Audience | Question it answers | Example |
|---|---|---|---|
| L1 — Platform health | Platform team, SRE | Is the automation platform itself healthy? | AAP control plane saturation, EDA rulebook activations, queue depth |
| L2 — Workload domain | Domain owners | Is my application’s automation healthy? | Per-business-app SLO dashboards, per-team failure rates |
| L3 — Investigation | On-call during incident | Why did this specific run fail? | Job-detail drill-down, traces, logs |
Every alert routes to a specific dashboard. The Slack message format is rigid:
🔴 AAP job failure rate > 10%
Severity: critical | Triggered: 14:03 | Active: 12m
Dashboard: https://grafana.kv.local/d/aap-l1-health (L1)
Investigation: https://grafana.kv.local/d/aap-l3-jobs (L3)
Runbook: https://wiki.kv.local/runbooks/aap-failure-spike
Every responder gets the same starting point. No “where do I look?” question.
4.1 The L1 platform-health dashboard
Twelve panels:
- Job rate (5m):
rate(awx_status_total[5m])stacked by status — see traffic + errors at once - p95 job duration by template family: 95th percentile from
histogram_quantile(0.95, rate(ansible_task_duration_seconds_bucket[5m])) - Control node CPU/memory:
node_cpu_seconds_total,node_memory_*filtered to AAP nodes - Postgres health (AAP database): connection pool, replication lag, slow queries
- Receptor mesh state: AAP’s internal mesh — node count, peer health, message queue depth
- EDA rulebook activations: count of running rulebooks;
upfor each rulebook process - Inventory sync health: time since last successful sync per inventory; alert > 4h
- Webhook receivers: HTTP rates and error rates on AAP webhook endpoints
- Subscription / license:
awx_subscription_totalvs limit - Top 10 slowest jobs (last 24h): tabular drill-down
- Top 10 most-failing templates (last 7d): tabular drill-down
- CHG-compliance metric: % of production jobs that ran with a
change_request_numberextra var
That last panel is the most underrated. It’s a single number that answers “is governance actually working?” Healthy organisations keep this at 100% (excluding the read-only template list). Drop below 99% and it’s a P2 incident — someone has bypassed the gate.
5. Logs: Loki and structured AAP output
AAP emits two distinct log streams:
- Job execution logs — stdout/stderr of every playbook run; written to disk and accessible via API
- Service logs — control plane internals (web tier, task scheduler, callback receiver)
Both should land in Loki via promtail or Vector. The crucial discipline is structured logging: rather than free-form prose, use the community.general.log_plays callback plugin to emit JSON-per-task:
[defaults]
callbacks_enabled = ansible.posix.opentelemetry, community.general.log_plays
log_path = /var/log/ansible/play-{{ tower_job_id }}.json
Each JSON record contains ts, host, task, module, status, result.changed, result.msg. Loki labels include job_id, template_name, inventory, severity. The query language (LogQL) becomes precise:
# Find all failed `template` module tasks across all jobs in last hour
{template_name=~".+"}
| json
| status="failed"
| module="template"
| line_format "{{.task}} on {{.host}}: {{.result_msg}}"
That single query, displayed in a Grafana log panel beside the L1 metrics, lets on-call instantly see “what’s failing right now” without clicking into individual jobs.
For ad-hoc operator debugging, a pre-built saved query for “show me everything from job 12345”:
{job_id="12345"} | json | line_format "{{.host}} | {{.task}} | {{.status}} | {{.result_msg}}"
Same data, different filter. The point is that every log query from operators should use the structured fields, not full-text grep. Free-text search of multi-GB log streams is prohibitively expensive at scale; field-indexed query is fast.
5.1 Log retention discipline
Operations logs typically need 30-90 days of hot storage; compliance often mandates 1-7 years for production change records. Structure your Loki tiering accordingly:
- 0-7 days: hot storage, full indexing, sub-second query
- 7-90 days: warm storage, ~1s query, retained for incident investigation
- 90 days - 7 years: cold storage in S3/object lock, queryable but slow, primarily for compliance
Most environments use Loki’s built-in compactor to handle this; Mimir / Cortex have the same pattern for metrics. Without these tiers, observability storage cost runs away within a quarter.
6. Traces and the multi-host correlation problem
The single most useful capability traces unlock is per-host execution timeline visualisation. AAP’s UI shows a job’s task list serially. A trace shows the same job as a Gantt chart across hosts: host1 ran task A from 14:03:00 to 14:03:08, then waited 4 seconds, then ran task B; host2 ran task A from 14:03:01 to 14:03:25 (slow!) — and immediately you can see which host is the long pole.
The default Tempo + Grafana visualisation gives you this for free once OTel is wired. The skills to use it well:
Find slow tasks across a fleet: trace search with service.name = ansible-aap AND duration > 30s shows every task that took longer than 30 seconds, grouped by task name. Discover that template render on host group X is consistently slow → investigate filesystem latency on those hosts.
Find failures correlated by module: trace search with service.name = ansible-aap AND status = error AND ansible.task.module = systemd shows all systemd-related failures across the fleet, last 24h. Discover a pattern (specific service name on specific OS version) and fix it once.
Cross-system correlation: this is where traces really earn their keep. Wire OTel into your AAP webhook receiver, into Event-Driven Ansible, into the application code that triggered the workflow. Now a single trace shows: “User clicked Slack button → Slack webhook received by EDA → EDA fired remediation rulebook → AAP launched job → Ansible ran on host → host’s metric came back to normal.” That entire causal chain in one trace, linked by traceparent headers passed at every boundary.
Implementing the full chain requires:
- AAP webhook receiver propagates incoming
traceparentinto the launched job’s extra vars - The Ansible callback plugin reads
traceparentfrom extra vars and uses it as the parent context - EDA’s rulebook engine, when triggering a job via API, propagates its current trace context
- Slack/Teams bots, when invoking AAP, set
traceparentfrom their incoming request
This is fiddly to set up but transformative once running. Mean time to root cause for “why did remediation fail?” drops from 30 minutes of cross-system investigation to one Grafana click.
7. The ServiceNow event correlation
Linking ITSM to observability is the final capstone wire. Two integration directions:
ServiceNow → metrics: Every CHG, INC, and PRB record event posts to a webhook that emits a Prometheus event. You get metrics like:
servicenow_chg_total{state, type, environment}— change rate by stateservicenow_inc_total{priority, assignment_group, category}— incident rateservicenow_chg_lead_time_seconds_histogram— time from CHG.opened to CHG.implement
Metrics → ServiceNow: Alertmanager’s webhook receiver creates ServiceNow incidents directly:
# alertmanager.yml
receivers:
- name: servicenow
webhook_configs:
- url: 'https://aap.kv.local/api/v2/job_templates/snow-create-inc/launch/'
send_resolved: true
http_config:
authorization:
type: Bearer
credentials_file: /etc/alertmanager/aap-token
The “snow-create-inc” job template runs a playbook that takes the alertmanager payload, derives priority/category/assignment, and creates an INC via servicenow.itsm.incident. Now every operationally significant alert has a ticket; every ticket auto-resolves when the alert clears.
The closed-loop pattern, end-to-end:
1. Host metric crosses threshold (e.g. disk > 90%)
2. Prometheus fires alert → Alertmanager → AAP webhook
3. AAP creates ServiceNow INC, priority computed from severity
4. EDA rulebook subscribed to "INC created with category=disk" fires
5. EDA launches "INC: Disk cleanup" job template
6. Job runs cleanup, verifies disk now < 80%
7. Job posts work note + resolves INC
8. Host metric returns to normal → Alertmanager fires resolved
9. AAP webhook closes any matching open INCs (idempotent)
In a healthy organisation this loop runs hundreds of times a day, with humans involved only on the long tail of cases the automation cannot handle. The metric to track is the auto-resolution rate — the percentage of incidents that closed without human intervention. Healthy mature platforms reach 60-80%; the remaining 20-40% are the genuinely novel issues humans should focus on.
8. SLOs as the contract
The thread that holds the whole observability story together is the Service Level Objective. For an automation platform, the SLOs that matter:
| SLO | Target | Measurement |
|---|---|---|
| Platform availability | 99.9% (≈ 8h downtime/year) | AAP /health/ returns 200 |
| Job success rate | 99% (excluding intentional failures) | awx_status_total{state="successful"} / awx_status_total |
| p95 job latency by template family | varies (e.g. patching < 30 min, config-drift < 5 min) | OTel-derived histogram |
| Mean time to resolution (auto-remediation) | < 5 min p95 | INC opened → INC resolved (where assignment_group matches automation) |
| Auto-resolution rate | > 60% | INC auto-resolved / INC total |
| CHG-compliance | 100% on production templates | jobs-with-chg / production-jobs |
| Backup restore drill success | 100% | drill-passed / drill-total (rolling 90d) |
The discipline: every SLO has an error budget. If your platform availability SLO is 99.9%, you have 0.1% of “budget” to spend on outages, deploys, and changes per quarter. When the budget is consumed, you must freeze risky changes until the budget recovers (typically over the next 30 days).
This is the SRE playbook applied to automation. The error budget aligns the platform team’s incentives: they want to ship features, but every failure consumes budget; therefore quality and reliability work earn the right to ship the next feature. Without this, the platform team always picks features over reliability, and the platform degrades over time.
A Grafana SLO dashboard panel:
Availability SLO: 99.9% (target)
Last 30 days: 99.94% (above target ✅)
Error budget remaining: 73% (8h 12m of 11h 43m)
Burn rate (1h): 0.4x (sustainable)
Burn rate (24h): 1.1x (sustainable)
When burn rate exceeds 14.4x for an hour or 6x for six hours (Google’s recommended thresholds), page on-call. Otherwise the SLO panel is just a quiet, daily health check.
9. The closed feedback loop in practice
The thing that makes all of this actually transformative is when alerts trigger automation that closes the alert — and the loop is observable end-to-end. The full lifecycle:
Step 1: A node_exporter metric on prod-app-04 shows node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.10.
Step 2: Prometheus fires DiskSpaceLow alert. Alertmanager routes to webhook receiver.
Step 3: AAP Create-INC template runs. Creates ServiceNow INC with category=disk, priority=2, cmdb_ci=prod-app-04. Trace ID abc123.
Step 4: EDA rulebook incident-remediation polls ServiceNow every 30s, sees the new INC, matches the rule for “disk full,” calls AAP API to launch INC-disk-cleanup template. Propagates abc123 as traceparent.
Step 5: AAP runs the cleanup playbook on prod-app-04. The OTel callback plugin uses abc123 as the trace root. Tasks run, logs flow to Loki tagged with traceparent=abc123.
Step 6: Cleanup succeeds. df shows disk now at 67%. Playbook posts work note, resolves INC, sets close_code=“Solved (Permanently)”.
Step 7: 60 seconds later, node_exporter scrape shows disk back below threshold. Prometheus fires DiskSpaceLow resolved.
Step 8: Alertmanager sends resolved notification. AAP webhook receives it, looks for any open INCs matching this CI + alert; the INC is already resolved, so this is a no-op (idempotent).
The Grafana dashboard for this single incident shows:
- A trace tree starting at the alert webhook, through INC creation, through job launch, through every Ansible task, ending in the resolved INC update — all under
abc123 - Metric panels showing the disk utilisation falling exactly when the cleanup tasks ran
- Log panels filtered by
traceparent=abc123showing every line of every task - ServiceNow INC linked from the trace root span
That single screen tells the story of one auto-healed incident with zero ambiguity. Operationally, this is gold — but it is also the evidence artefact an auditor wants when asking “show me an example of how your automation responds to incidents.” One trace ID, one Grafana link, one minute to walk through the full chain.
10. Operational rituals that keep observability healthy
A surprising failure mode: organisations build great observability, then it decays over months. Three rituals prevent this:
Weekly observability review (15 min): Platform team reviews:
- Top 10 noisiest alerts by volume — are they actionable? If not, fix or delete.
- Any dashboards that haven’t been opened in 30 days — delete (yes, really).
- Any metrics with cardinality > 100k — investigate; usually a label runaway bug.
- SLO burn rates — anything above 1.0x merits investigation.
Monthly chaos game day: Pick one specific failure mode (control plane node down, Postgres replica lagging, Loki ingest backlogged) and verify your alerting catches it within target latency. Failures here mean the alert is misconfigured; fix it before the real outage.
Quarterly dashboard pruning: Each domain owner reviews their L2 dashboards. If a panel has fired no useful insight in 90 days, either rewrite it to be useful or remove it. Dashboards bloat over time; aggressive pruning keeps them readable.
The principle: observability is a product, not a one-time build. It needs roadmap, ownership, and continuous quality work. The orgs that get this wrong end up with massive observability bills, dashboards no one looks at, alerts that fire constantly and are universally muted — and they then re-build the whole thing every two years. The orgs that get it right have a stable, slowly-evolving observability layer that sustains for a decade.
11. Cardinality discipline (the silent killer)
A specific failure mode worth its own section: metric cardinality explosion. Prometheus is fast and cheap if you keep cardinality bounded; it falls over hard if you don’t.
Examples of cardinality bombs:
# BAD — user_id has unbounded cardinality
rate(api_request_total{user_id="$user"}[5m])
# BAD — unique trace IDs as label
counter.add(traceparent=$traceparent)
# BAD — host names without bucketing
ansible_task_total{host="host-12345.cluster.local"}
Every unique label value combination is a separate time series. 10 templates × 100 hosts × 50 tasks × 4 statuses = 200k series. Add host_ip_address as a label and now it’s 200k × 1k IPs = 200M series. Prometheus crashes.
The discipline:
- Bucket high-cardinality fields:
template_family(10 values) instead oftemplate_name(1000 values); record the full template_name in the log, not the metric - Never use IDs as labels: trace IDs, request IDs, user IDs, transaction IDs go in logs and traces, not metrics
- Drop labels you don’t query on: every label you add costs forever; if no PromQL query in your repo references it, delete it
A continuous monitoring metric:
# alert if any single metric exceeds 100k series
- alert: HighCardinalityMetric
expr: |
count by (__name__) (count by (__name__) ({__name__=~".+"})) > 100000
for: 30m
This catches cardinality bombs within an hour of introduction, before they impact ingestion performance.
12. Budgeting and capacity
Final practical concerns. Observability is not free, and ungoverned observability costs grow faster than the workload they observe.
Rough annual costs at scale (industry typical, varies by vendor and region):
- Metrics: $0.15-1.00 per active series per year. 1M series ≈ $200k-1M/year. Aggressive aggregation cuts this 5-10x.
- Logs: $0.50-2.00 per GB ingested. 5TB/day ≈ $1-4M/year. Tiered storage and structured-log compression cut this 3-5x.
- Traces: $0.10-0.50 per million spans. With 10% sampling, 1B spans/day ≈ $400k-2M/year.
The most common pattern: an organisation spends $X on the cloud workload they’re observing and $0.5-2X on observing it. That’s normal. When it exceeds 2X you have a quality problem (label explosion, log spam, no sampling) — fix the discipline, not the budget.
Capacity planning for the observability stack itself:
- Prometheus / Mimir: scale by active series (target < 5M per node)
- Loki: scale by ingest rate (target < 50MB/s per ingester)
- Tempo: scale by spans per second (target < 100k spans/s per ingester)
- Grafana: scale by concurrent dashboards (target < 100 concurrent users per node)
Run synthetic load tests quarterly to verify the headroom. Never let any component exceed 70% utilisation in steady state — the spike on the day of an incident will push it over.
13. Where this leaves you
You have just completed Tier 5 of this Ansible course. Across ten lessons we’ve covered:
- D1 — Compliance: STIG, CIS, OpenSCAP, signed evidence
- D2 — Disaster recovery: hybrid topology, DR drills, RTO/RPO discipline
- D3 — Migrations: P2V, V2V, leapp, RHEL major upgrades
- D4 — Air-gap: soft / sneakernet / data-diode archetypes
- D5 — SAP: HANA, Netweaver, redhat.sap collections
- D6 — Edge/IoT: pull-mode, bootc, k3s+fleet
- D7 — ITSM/ChatOps: ServiceNow CMDB, CHG-gating, Slack/Teams approval
- D8 — Backup: 3-2-1-1-0, immutability, restore drills
- D9 — Database migrations: expand-contract, blue-green, online DDL
- D10 — Observability (this lesson): the closed feedback loop
What you should walk away with is the conviction that mature automation in a regulated enterprise is not a single tool or playbook. It is an interoperable system of disciplines: governance via ITSM, content via Ansible, evidence via signed bundles, recovery via tested DR, scale via fleet patterns, and visibility via observability. None of these alone is sufficient; together they form a platform that auditors trust, executives can defend, and engineers actually want to use.
The next steps from here depend on your role:
- Platform engineer: pick the lesson with the biggest gap in your environment. For most orgs that is D8 (backup automation) or D7 (ITSM integration). Ship one of these in the next quarter.
- SRE / Operations: implement D10 (this lesson) end-to-end. The other lessons amplify their value once the observability layer exists.
- Architect / Tech lead: use Tier 5 as a maturity assessment. For each pillar, ask “what is our current state? What is our 12-month target? What does the gap require?”
- Engineering leader: this curriculum maps to a 12-24 month maturity transformation. Treat it as a programme, not a project. The teams that deliver each pillar in 3-month iterations succeed; the teams that try to do all ten simultaneously fail.
The hardest lesson across this whole course is also the simplest: automation is a cultural and organisational artefact as much as a technical one. The playbooks are the easy part. The disciplines — change-management, evidence-bundling, restore-testing, SLO-budgeting — are what separate organisations whose automation actually works from those whose automation is a slide deck. This curriculum exists to give you the patterns. Whether they take root depends on the people, leadership, and engineering culture you build around them.
That’s what makes the journey worth it.