Ansible for Security Compliance, In Depth — STIG, CIS, OpenSCAP and Policy-as-Code
Compliance work is the part of operations that breaks otherwise good engineers. The frameworks themselves are not the problem; the problem is that they are written for humans, in PDFs, with thousands of pages of conditional rules, and the auditor expects you to prove every rule on every host on a quarterly cadence. That is impossible with checklists. It is straightforward with Ansible, OpenSCAP, and a little discipline.
This lesson is the specialist-tier playbook for operationalising security compliance: how to translate DISA STIG content, the CIS Benchmarks, and PCI/HIPAA/SOC 2 control families into Ansible roles that harden, scan, remediate, and re-scan automatically — and how to produce the evidence packages your auditors actually accept.
We will cover RHEL 9 (the canonical STIG target), Ubuntu 22.04 LTS (the canonical CIS target), and Windows Server 2022 (which has both DISA STIG content and a Microsoft baseline). We will not cover the dozen vendor “compliance scanners” that pretend Ansible does not exist; we will use the open-source toolchain (openscap, ansible-hardening, ansible-lockdown) that the regulators themselves consume.
Position in the curriculum. This lesson assumes you are comfortable with everything from Tier 1 through Tier 4 — roles, dynamic inventory, AAP workflows, vault, no_log discipline, and at least one cloud collection. Compliance work touches every layer, so we will reference earlier lessons frequently.
What “compliance” really means in the Ansible context
A common mistake is to treat compliance as a one-time hardening pass. It is not. Compliance is a control loop, and Ansible is the actuator. The loop has four moving parts:
- Policy — the authoritative document. STIG, CIS, PCI-DSS, HIPAA Security Rule, SOC 2 CC6.x. You do not edit these; you encode them.
- Configuration — the desired state of every host, derived from the policy. This lives in your Ansible roles and host/group vars.
- Drift detection — the scan. Run OpenSCAP (
oscap) or vendor tooling against each host on a schedule and compare actual to desired. - Remediation — the playbook that reasserts desired state when drift is found. This is just
ansible-playbookwith the right roles loaded; the same code that originally hardened the host re-hardens it.
The whole loop is the deliverable. Auditors do not ask “is this host compliant today?” They ask “show me how you know it is compliant every day, who can break it, and how fast you re-converge.” The answer is your AAP workflow, your scan reports, your RBAC, and your remediation history — not a PDF.
Module 1 — The standards landscape
Before any code, you need a working mental model of the documents you are encoding.
| Standard | Maintainer | Format | Coverage | Best for |
|---|---|---|---|---|
| DISA STIG | US Department of Defense | XCCDF + OVAL XML, plus .zip of supporting docs |
RHEL 7/8/9, Ubuntu 20/22, Win 2019/2022, AIX, Solaris, networking gear | Federal/DoD, FedRAMP High, regulated contractors |
| CIS Benchmarks | Center for Internet Security | PDF + XCCDF (for members) | Hundreds of products: every major OS, K8s, AWS/Azure/GCP, M365, browsers | Commercial sector, ISO 27001, PCI-DSS |
| ANSSI BP-028 | French ANSSI | PDF + XCCDF | RHEL/Debian | EU public sector |
| Microsoft Security Baseline | Microsoft | GPO + XML + scripts | Windows Server / Client / M365 | Pure Microsoft estates |
OpenSCAP / SCAP Security Guide (ssg) |
upstream open source | XCCDF + OVAL + ansible playbook generator | RHEL/CentOS/Fedora/Ubuntu/Debian | Anyone — this is the engine that runs the others |
You will use two layers of tooling no matter which standard you pick:
- The standard’s content pack — XCCDF + OVAL XML, the machine-readable rules.
- OpenSCAP (
oscapCLI) — the scanner that consumes that XML and emits a report.
OpenSCAP is the single piece that ties the world together. RHEL ships it in the base repos (dnf install scap-security-guide openscap-scanner). Ubuntu ships it as ssg-base ssg-debderived libopenscap8. Windows uses OVAL directly through PowerShell DSC or oscap-windows.
The canonical artefact pair
Every compliance project produces two artefacts, and the auditor wants both:
- The remediation playbook — Ansible code that brings the host into compliance. You either write it, or generate it from the SCAP profile with
oscap xccdf generate fix --fix-type ansible. - The scan report — the HTML/XML output of
oscap xccdf evalshowing every rule and pass/fail. This is the evidence.
These two artefacts must reference the same profile ID. If they don’t, the auditor will reject the package. The profile ID is the key — for example xccdf_org.ssgproject.content_profile_stig is the RHEL 9 STIG profile shipped by scap-security-guide.
What Ansible owns vs. what OpenSCAP owns
Keep this division of labour clear or you will fight your tooling:
- OpenSCAP scans. It reads the XCCDF/OVAL definitions and emits per-rule pass/fail. It can also emit a remediation script (Bash, Ansible, or Puppet), but the remediation it generates is shallow and rule-by-rule. Do not ship that remediation to production.
- Ansible enforces. Your hand-curated roles (informed by the SCAP rules but written for your environment) are what actually configure the host. Ansible owns the desired state.
Why hand-curate instead of oscap-generate? Because compliance rules conflict with operational reality. STIG V-230229 says “audit daemon must drop logs to disk every 50 events”; your Splunk forwarder needs audispd running. STIG V-230502 says “remove nfs-utils”; your shared storage runs over NFS. The generated playbook will happily delete nfs-utils and break production. You write the role; OpenSCAP only reports.
Module 2 — Building a hardening role: the structural pattern
This is the canonical layout I use for every compliance role I have shipped in the last seven years. It scales from a single host to ten thousand and from one standard to a dozen overlapping ones.
roles/
└── kv.compliance_rhel9/
├── defaults/
│ └── main.yml # all togglable controls, on by default
├── handlers/
│ └── main.yml # service reloads, sysctl reapply
├── meta/
│ └── main.yml # min ansible_version, dependencies
├── tasks/
│ ├── main.yml # dispatcher — imports per-section files
│ ├── 01_account.yml # account & password policy (STIG V-230x)
│ ├── 02_audit.yml # auditd & audit rules (STIG V-230x)
│ ├── 03_filesystem.yml # mount options, permissions
│ ├── 04_kernel.yml # sysctl, kernel modules
│ ├── 05_logging.yml # rsyslog, journald, log retention
│ ├── 06_pam.yml # PAM password quality, faillock
│ ├── 07_ssh.yml # sshd_config hardening
│ ├── 08_firewall.yml # firewalld zones & rich rules
│ ├── 09_selinux.yml # SELinux mode + booleans
│ └── 10_packages.yml # remove banned, install required
├── templates/
│ ├── auditd.conf.j2
│ ├── sshd_config.j2
│ └── pwquality.conf.j2
└── vars/
└── main.yml # rule IDs, control mappings
The pattern in tasks/main.yml is just a dispatcher with explicit imports — never use dynamic include_tasks here, because we want a single deterministic execution order under --check:
# roles/kv.compliance_rhel9/tasks/main.yml
- name: Section 01 — Account and password policy
ansible.builtin.import_tasks: 01_account.yml
tags: [account, stig, cat2]
- name: Section 02 — Audit subsystem
ansible.builtin.import_tasks: 02_audit.yml
tags: [audit, stig, cat1]
- name: Section 03 — Filesystem hardening
ansible.builtin.import_tasks: 03_filesystem.yml
tags: [filesystem, stig, cat2]
# ...etc
Tagging convention
Tag every task with three things — and only three — so the auditor and the operator can both navigate:
- The section (
account,audit,ssh, …) - The standard (
stig,cis,pci) - The severity (
cat1/cat2/cat3for STIG,level1/level2for CIS)
This lets you say --tags stig,cat1 and apply only the highest-severity STIG controls — which is exactly the dance you do during break-fix windows when you have 30 minutes and need to fix the bleeding.
Defaults file: the togglable surface
Every control must be a variable so operators can override per host or group. Defaults live in defaults/main.yml and are on for compliance by default:
# roles/kv.compliance_rhel9/defaults/main.yml
# --- Section 01: account & password ---
kv_compliance_password_minlen: 15 # STIG V-230367
kv_compliance_password_max_days: 60 # STIG V-230367
kv_compliance_password_remember: 5 # STIG V-230368
kv_compliance_account_lockout_attempts: 3 # STIG V-230333
kv_compliance_account_lockout_unlock: 0 # 0 = require admin
kv_compliance_password_quality_dcredit: -1
kv_compliance_password_quality_ucredit: -1
kv_compliance_password_quality_lcredit: -1
kv_compliance_password_quality_ocredit: -1
kv_compliance_password_quality_minclass: 4
# --- Section 02: audit ---
kv_compliance_auditd_max_log_file_mb: 8 # STIG V-230398
kv_compliance_auditd_disk_full_action: "halt" # CAT 1 says halt; we soften to syslog in non-prod
kv_compliance_auditd_space_left_mb: 100
# --- Section 07: ssh ---
kv_compliance_ssh_permit_root: false # STIG V-230296
kv_compliance_ssh_password_auth: false # CIS 5.2.10
kv_compliance_ssh_max_auth_tries: 3
kv_compliance_ssh_login_grace_time: 60
kv_compliance_ssh_client_alive_interval: 300
kv_compliance_ssh_client_alive_max: 3
kv_compliance_ssh_allowed_users: [] # group_vars overrides
# --- Section 09: SELinux ---
kv_compliance_selinux_state: "enforcing"
kv_compliance_selinux_policy: "targeted"
Two non-obvious points here:
- Every default is named after the standard’s variable convention (in this case
ssg’svar_password_pam_minlen→ ourkv_compliance_password_minlen). This means the auditor can grep your repo for the rule and find your value in seconds. - Every default has a comment with the rule ID. This is the single highest-leverage habit you can build. You will thank past-you a thousand times.
Module 3 — A real STIG control, end to end
Let us encode one rule completely so the pattern sticks. We will pick STIG V-230230 — “RHEL 9 must have the AIDE package installed”. AIDE is a file-integrity monitor; it is mandatory for CAT II.
The rule, in plain English
The Advanced Intrusion Detection Environment (AIDE) package must be installed. AIDE provides file integrity monitoring; the absence of AIDE means a finding of CAT II.
The Ansible task
# roles/kv.compliance_rhel9/tasks/02_audit.yml (excerpt)
- name: STIG V-230230 — AIDE package installed
ansible.builtin.dnf:
name: aide
state: present
tags: [audit, stig, cat2, V-230230]
- name: STIG V-230229 — AIDE database initialised once
ansible.builtin.command: aide --init
args:
creates: /var/lib/aide/aide.db.new.gz
tags: [audit, stig, cat2, V-230229]
- name: STIG V-230229 — AIDE database moved into place
ansible.builtin.copy:
src: /var/lib/aide/aide.db.new.gz
dest: /var/lib/aide/aide.db.gz
remote_src: true
owner: root
group: root
mode: "0600"
tags: [audit, stig, cat2, V-230229]
- name: STIG V-230229 — AIDE timer enabled
ansible.builtin.systemd_service:
name: aide-check.timer
enabled: true
state: started
tags: [audit, stig, cat2, V-230229]
Notice the rule ID is in both the name and the tag list. This is non-negotiable. When the scanner reports xccdf_org.ssgproject.content_rule_package_aide_installed failed, you can grep for V-230230 and land on the exact tasks that should fix it.
The OpenSCAP equivalent
For comparison, here is what oscap xccdf generate fix --fix-type ansible emits for the same rule:
- name: 'Install aide Package'
package:
name: aide
state: present
tags:
- CCE-83438-7
- PCI-DSS-Req-11.5
- NIST-800-53-CM-3
- DISA-STIG-RHEL-09-651010
- low_severity
- low_complexity
- low_disruption
It looks fine, but it is not idempotent in your environment — it does not initialise the database, it does not enable the timer, and it does not mode/own the database file. That is why you write the role yourself and use SCAP only to scan.
The handler (if required)
For rules that touch a service config, you wire a handler:
# roles/kv.compliance_rhel9/handlers/main.yml
- name: Reload sshd
ansible.builtin.systemd_service:
name: sshd
state: reloaded
- name: Reload auditd
ansible.builtin.command: augenrules --load
notify: Restart auditd
- name: Restart auditd
ansible.builtin.systemd_service:
name: auditd
state: restarted
Note auditd cannot be reloaded on RHEL — augenrules --load reads the rules then systemctl restart auditd re-applies them. This is one of the single most-broken patterns in copy-pasted STIG roles. Get it right.
Module 4 — SSH hardening: the single most-violated control family
If you only do one section, do SSH. Auditors look at it first. Here is the canonical template — fully variable-driven, fully tagged, fully traceable to STIG and CIS:
# roles/kv.compliance_rhel9/templates/sshd_config.j2
# Managed by Ansible — do not edit by hand.
# Source: roles/kv.compliance_rhel9/templates/sshd_config.j2
# STIG profile: stig | CIS profile: stig_gui
Port 22
AddressFamily inet
ListenAddress 0.0.0.0
# STIG V-230296 / CIS 5.2.8 — root login forbidden
PermitRootLogin {{ 'no' if not kv_compliance_ssh_permit_root else 'prohibit-password' }}
# STIG V-230282 / CIS 5.2.10
PasswordAuthentication {{ 'yes' if kv_compliance_ssh_password_auth else 'no' }}
PermitEmptyPasswords no
# STIG V-230288
PubkeyAuthentication yes
ChallengeResponseAuthentication no
KbdInteractiveAuthentication no
# STIG V-230285 / CIS 5.2.18
MaxAuthTries {{ kv_compliance_ssh_max_auth_tries }}
LoginGraceTime {{ kv_compliance_ssh_login_grace_time }}
MaxSessions 4
# STIG V-230381 — idle session termination
ClientAliveInterval {{ kv_compliance_ssh_client_alive_interval }}
ClientAliveCountMax {{ kv_compliance_ssh_client_alive_max }}
# STIG V-230332 — banner
Banner /etc/issue.net
# STIG V-230290 — strong KEX/MAC/cipher only (FIPS-aligned)
KexAlgorithms curve25519-sha256,curve25519-sha256@libssh.org,diffie-hellman-group16-sha512
MACs hmac-sha2-512,hmac-sha2-256
Ciphers aes256-gcm@openssh.com,aes128-gcm@openssh.com,aes256-ctr,aes128-ctr
HostKeyAlgorithms ssh-ed25519,rsa-sha2-512,rsa-sha2-256
# STIG V-230289 — protocol 2 only
Protocol 2
UseDNS no
UsePAM yes
X11Forwarding no
PrintMotd no
AcceptEnv LANG LC_*
Subsystem sftp /usr/libexec/openssh/sftp-server -f AUTHPRIV -l INFO
{% if kv_compliance_ssh_allowed_users %}
AllowUsers {{ kv_compliance_ssh_allowed_users | join(' ') }}
{% endif %}
The accompanying task:
# roles/kv.compliance_rhel9/tasks/07_ssh.yml
- name: STIG V-230282..230296 — sshd_config baseline
ansible.builtin.template:
src: sshd_config.j2
dest: /etc/ssh/sshd_config
owner: root
group: root
mode: "0600"
validate: /usr/sbin/sshd -T -f %s # validate before swap
notify: Reload sshd
tags: [ssh, stig, cis, cat1]
- name: STIG V-230290 — host key permissions
ansible.builtin.file:
path: "{{ item }}"
owner: root
group: ssh_keys
mode: "0640"
loop:
- /etc/ssh/ssh_host_rsa_key
- /etc/ssh/ssh_host_ed25519_key
- /etc/ssh/ssh_host_ecdsa_key
failed_when: false # not all key types exist on every host
tags: [ssh, stig, cat2]
- name: STIG V-230332 — login banner
ansible.builtin.copy:
dest: /etc/issue.net
content: |
You are accessing a U.S. Government (USG) Information System (IS).
[...full DoD banner content...]
owner: root
group: root
mode: "0644"
tags: [ssh, stig, cat3]
The validate: parameter is the magic that makes this safe. sshd -T -f %s parses the candidate config; if it fails, template aborts and the host keeps the working config. Without validate:, a typo in the template can lock you out of every host in your fleet at the same time. Always use validate on services you depend on for management access.
Module 5 — auditd rule files
Audit rules are the second-most-violated family. The pattern is to ship rule fragments under /etc/audit/rules.d/ and let augenrules --load compile them.
# roles/kv.compliance_rhel9/tasks/02_audit.yml (continued)
- name: STIG V-230398..230428 — audit rules (drop-in fragments)
ansible.builtin.copy:
src: "rules.d/{{ item }}"
dest: "/etc/audit/rules.d/{{ item }}"
owner: root
group: root
mode: "0640"
loop:
- 10-base-config.rules
- 30-stig.rules # 200+ rule entries
- 40-local.rules
- 99-finalize.rules
notify: Reload audit rules
tags: [audit, stig, cat2]
- name: STIG V-230401 — auditd buffer
community.general.ini_file:
path: /etc/audit/auditd.conf
section: null
option: max_log_file
value: "{{ kv_compliance_auditd_max_log_file_mb }}"
no_extra_spaces: true
notify: Restart auditd
tags: [audit, stig, cat2]
A representative rule fragment (30-stig.rules — drop in files/rules.d/):
## STIG V-230402 — audit time-change syscalls
-a always,exit -F arch=b64 -S adjtimex,settimeofday,clock_settime -k time-change
-a always,exit -F arch=b32 -S adjtimex,settimeofday,clock_settime -k time-change
## STIG V-230410 — audit ownership changes
-a always,exit -F arch=b64 -S chown,fchown,fchownat,lchown -F auid>=1000 -F auid!=-1 -k perm_mod
## STIG V-230412 — audit DAC changes
-a always,exit -F arch=b64 -S chmod,fchmod,fchmodat -F auid>=1000 -F auid!=-1 -k perm_mod
## STIG V-230418 — audit failed unauthorised access
-a always,exit -F arch=b64 -S open,openat,creat,truncate,ftruncate -F exit=-EACCES -F auid>=1000 -F auid!=-1 -k access
-a always,exit -F arch=b64 -S open,openat,creat,truncate,ftruncate -F exit=-EPERM -F auid>=1000 -F auid!=-1 -k access
These are byte-identical to the SSG content — cat /usr/share/scap-security-guide/ansible/rhel9-playbook-stig.yml | grep -A2 audit_rules_time_change will show you the same lines. If the auditor compares yours to ssg, they should match exactly.
Module 6 — Driving OpenSCAP from Ansible
The scanning side closes the loop. The pattern is scan → fail open → publish report → never hide failures:
# roles/kv.compliance_rhel9/tasks/scan.yml — separate role/playbook
- name: Ensure scanner present
ansible.builtin.dnf:
name:
- openscap-scanner
- scap-security-guide
state: present
- name: Run OpenSCAP scan
ansible.builtin.command: >
oscap xccdf eval
--profile xccdf_org.ssgproject.content_profile_stig
--results /var/log/openscap/{{ inventory_hostname }}-results.xml
--report /var/log/openscap/{{ inventory_hostname }}-report.html
--fetch-remote-resources
/usr/share/xml/scap/ssg/content/ssg-rhel9-ds.xml
register: scan
changed_when: false
failed_when: false # scanner always exits 2 on findings; do NOT fail here
tags: [scan, never]
- name: Fetch report to control node
ansible.builtin.fetch:
src: "/var/log/openscap/{{ inventory_hostname }}-report.html"
dest: "evidence/{{ ansible_date_time.date }}/"
flat: false
tags: [scan, never]
- name: Pull machine-readable XML for diffing
ansible.builtin.fetch:
src: "/var/log/openscap/{{ inventory_hostname }}-results.xml"
dest: "evidence/{{ ansible_date_time.date }}/"
flat: false
tags: [scan, never]
Critical detail: oscap exits 2 when it finds rule failures, 0 when fully compliant. Do not let Ansible interpret 2 as failure — set failed_when: false. The scan is allowed to find issues; that is its job. Failure of the play is only when the binary itself crashes.
The reports go to evidence/$DATE/ on the control node. Wire a job in AAP to commit those to a separate kv-evidence Git repo, never the main Ansible repo. Auditors get read-only access to that repo and can clone for forensics.
Module 7 — Multi-standard overlay: STIG + CIS + PCI on the same host
Real environments rarely have one standard. A typical regulated SaaS company runs PCI-DSS for the payment subsystem and SOC 2 across the whole estate; a federal contractor runs STIG and FISMA. You handle this with profile variables, not duplicate roles.
# inventory/group_vars/payment_systems.yml
kv_compliance_profile: pci
kv_compliance_password_minlen: 12 # PCI 8.2.3 minimum
kv_compliance_password_max_days: 90
kv_compliance_account_lockout_attempts: 6
# inventory/group_vars/federal.yml
kv_compliance_profile: stig
kv_compliance_password_minlen: 15
kv_compliance_password_max_days: 60
kv_compliance_account_lockout_attempts: 3
# inventory/group_vars/all.yml — defaults
kv_compliance_profile: cis_level1
The same role applies to all three groups. Variable precedence resolves the difference. Where rules genuinely conflict (PCI 8.2.4 says 90 days max; STIG says 60 days max), pick the stricter and document it — auditors of both standards then accept the same evidence.
The role’s defaults/main.yml should reflect the strictest baseline (STIG), and overrides relax for less-strict standards. This is the right direction: it is much easier to audit-explain “we tightened beyond CIS” than “we loosened below STIG”.
Module 8 — CIS for Ubuntu and the Lockdown collection
For Ubuntu, the open-source community ships ansible-lockdown/UBUNTU22-CIS — a maintained role that follows the same shape we just built. You do not need to reinvent it; you fork it, pin it, and override variables.
# requirements.yml
roles:
- name: kv.cis_ubuntu22
src: https://github.com/ansible-lockdown/UBUNTU22-CIS
version: 1.6.1
scm: git
Then in your playbook:
- hosts: ubuntu_servers
become: true
roles:
- role: kv.cis_ubuntu22
vars:
ubtu22cis_level_2: false # we run Level 1
ubtu22cis_rule_1_1_1_1: true # cramfs disabled
ubtu22cis_rule_5_2_2_2: true # SSH PermitRoot no
ubtu22cis_rule_3_5_1_4: false # ufw disabled in our env (we use security groups)
Two reasons this works in production:
- The role is declarative: every CIS rule is a Boolean
ubtu22cis_rule_X_Y_Ztoggle, defaulttrue. You disable the ones that conflict with your environment by settingfalseand documenting why in a CHANGELOG. - The role’s tags mirror CIS section IDs (
section1,section3,level1,level2), so you can run partial reapplications.
For RHEL the equivalent is ansible-lockdown/RHEL9-STIG and RHEL9-CIS. Use them. You are not paid to write SSH hardening from scratch every quarter.
Module 9 — Windows compliance with microsoft.security baseline
Windows is the same pattern, executed differently. Microsoft publishes Security Baselines for Windows Server 2022 and Windows 11; DISA publishes a STIG for the same. The baselines arrive as GPOs, but Microsoft also ships them as Ansible roles via the microsoft.security collection (preview at time of writing) and through PowerShell DSC, which Ansible can drive via ansible.windows.win_dsc.
A working pattern:
- hosts: windows_servers
tasks:
- name: Apply Windows 2022 STIG via DSC
ansible.windows.win_dsc:
resource_name: PowerSTIG
Version: "1.0.0"
OsVersion: "2022"
OsRole: "MS"
StigVersion: "2.7"
SkipRule:
- V-254241 # we accept residual risk; see ticket KV-1294
- V-254376
tags: [windows, stig, cat2]
- name: Local audit policy — overrides
ansible.windows.win_audit_policy_system:
subcategory: "{{ item.sub }}"
audit_type: "{{ item.type }}"
loop:
- { sub: "Logon", type: "success and failure" }
- { sub: "Account Lockout", type: "success and failure" }
- { sub: "Process Creation", type: "success" }
tags: [windows, audit, cat2]
PowerSTIG is an excellent project that compiles DISA’s XML into DSC resources; combined with Ansible’s win_dsc you get the same closed loop. Skip rules with SkipRule: — never silently — and always link the skip to a ticket number.
For scans, oscap-windows + scap-security-guide-windows give you the same OpenSCAP evidence pack. Run it under WinRM the same way you run RHEL OpenSCAP under SSH.
Module 10 — Wiring the workflow in AAP
The workflow that closes the loop has four jobs:
compliance-harden— runs the role, only against hosts incompliance_managedgroup, on a schedule (weekly, off-hours).compliance-scan— runs OpenSCAP and stores reports.compliance-evidence-publish— pushes reports to the evidence Git repo and tags the commit with the run ID.compliance-drift-alert— diffs today’s scan against yesterday’s; if rule failures increase, page the on-call.
# AAP Workflow definition (rendered):
start
├─> compliance-harden # runs always
│ └─> on success ─> compliance-scan
│ └─> always ─> compliance-evidence-publish
│ └─> always ─> compliance-drift-alert
└─> on failure ─> page-oncall + freeze-deploy-flag
A few non-obvious details:
compliance-scanalways runs even ifcompliance-hardenpartially failed. You want evidence of the failure state, not silence.compliance-drift-alertis what tells you when a non-Ansible change broke compliance. Console-clickers exist. The drift-detector catches them within 24 hours.- The freeze-deploy flag is a Boolean in AAP that other workflows check. If compliance is failing, your prod-deploy workflow skips itself and pages someone. Compliance protects the change pipeline from itself.
Module 11 — Evidence engineering
Auditors do not want logs. They want answers, with proof. The evidence package per quarterly audit should contain:
evidence/Q2-2026/
├── README.md # what's in this pack, who produced it, when
├── policies/
│ ├── stig-V2R3.zip # the policy document version we encoded
│ └── cis-rhel9-1.0.0.pdf
├── code/
│ └── kv-compliance-rhel9-v3.4.0.tar.gz # exact tagged version of role applied
├── runs/
│ └── 2026-04-15-harden-run.json # AAP job result (per host pass/fail)
├── scans/
│ ├── 2026-04-16-rhel9-scan-results-cluster1.html # report.html per host
│ ├── 2026-04-16-rhel9-scan-results-cluster1.xml
│ └── ...
├── exceptions/
│ ├── kv-1294-V-254241.md # signed exception with risk acceptance
│ └── kv-1297-V-230502.md
└── attestation.pdf # signed by CISO + sysadmin lead
This is shipped as a single tarball with a SHA-256 hash recorded in your ticketing system. It is the same shape every quarter. Building a script that produces it from your AAP run history takes one afternoon and saves you weeks of audit prep forever.
Module 12 — The exception process (where most teams fail)
Every compliance program fails not on the encoded controls but on exceptions. STIG V-230502 says “remove nfs-utils”; you cannot, because your storage is NFS. CIS 6.1.10 says “remove tcpdump”; you cannot, because you debug network issues with it. PCI 6.4.5 says “every change requires CAB approval”; AAP runs hundreds of changes a day.
You handle this with a structured exception:
# KV-EXCEPTION-1294
- **Rule**: STIG V-230502 — `nfs-utils` package
- **Standard**: DISA STIG RHEL 9 V2R3
- **Decision**: Accept residual risk
- **Reason**: Production storage for `prod-app-cluster` mounts NFS from on-prem NetApp. Removing `nfs-utils` would break business-critical workflows.
- **Compensating controls**:
- NFSv4 with sec=krb5p (Kerberos integrity + privacy)
- dedicated VLAN, no internet egress
- Network monitoring via Suricata IDS, alerting on non-NetApp traffic
- Quarterly NFS audit — evidence/Q*/scans/nfs-traffic-anomalies.csv
- **Approved by**: J. Smith (CISO), 2026-03-01
- **Reviewed by**: P. Jones (sysadmin lead), 2026-03-01
- **Expiry**: 2026-09-01 (re-review)
- **Linked tasks**: roles/kv.compliance_rhel9/vars/main.yml — kv_compliance_remove_nfs_utils: false
Two things make this work:
- The exception lives next to the code that implements it. When the operator looks at
kv_compliance_remove_nfs_utils: false, the next line in the comment is the exception number. - The exception expires. Six months is the default. At expiry, the exception is automatically re-reviewed; if the compensating controls still hold, it renews. If they do not, the exception is closed and the rule re-asserted.
Auditors will read these. Make them readable. Use the same template every time.
Module 13 — Three frequently asked questions
Q1. We have 5,000 RHEL hosts. Won’t running oscap on each take hours?
A1. About 8–12 minutes per host on modern hardware. With AAP forks=200 and a 5-host serial step, you finish a fleet-wide scan in under an hour. The bottleneck is usually the --fetch-remote-resources step — pre-mirror the SCAP content locally and pass --datastream to skip it.
Q2. What about regulatory frameworks that aren’t OS-level (PCI 4.0, HIPAA Security Rule, SOC 2)? A2. They map onto OS, network, and application controls. SOC 2 CC6.1 — “logical access” — is your SSH/PAM hardening + IAM. PCI 6.5 — “secure development” — is your linting + Molecule + image scanning. There is no “SOC 2 oscap profile”; instead, your compliance role implements the OS controls, your CI implements the application controls, and you maintain a control mapping spreadsheet that points each framework requirement at the role tag, CI pipeline step, or runbook that satisfies it.
Q3. The DISA STIG releases a new version (V2R4 → V2R5). What changes?
A3. Maybe 2–10 rules per quarter — added, removed, severity-changed. The pattern: subscribe to the DISA RSS feed, diff the new XCCDF against the previous one (oscap info --references), and produce a delta-PR against your role. Keep your role version-pinned to the STIG version (role v3.4.0 ↔ STIG V2R3), bump only when you’ve absorbed the diff. Never run with a STIG version newer than your role.
Q4. Is --check mode safe to run for compliance scans?
A4. Yes for Ansible-side validation — it tells you what would change. But --check is not equivalent to an OpenSCAP scan. SCAP checks the actual file contents on disk; Ansible --check only knows about what tasks would do. Run both: --check weekly for fast sanity, OpenSCAP nightly for legal evidence.
Q5. We want to use the same role for production and dev. How do we handle “this is just a dev box, don’t break ssh”?
A5. Don’t relax controls based on environment — relax based on audience. Your dev hosts are still subject to compliance if a developer with access to production code logs in. The right answer is: same role, same defaults, but dev hosts allow extra users via kv_compliance_ssh_allowed_users, and dev hosts get a longer client_alive_interval so dev sessions don’t time out during long debug sessions. The compliance posture stays identical; only the operational comfort relaxes.
Q6. What about agent-based compliance tools (Wazuh, Tenable, Qualys)? A6. Use them for continuous monitoring (real-time file integrity, log analysis), not for quarterly compliance. They live alongside, not instead of, the OpenSCAP + Ansible loop. Your evidence packages quote OpenSCAP; your incident response cites Wazuh.
Q7. Why not just buy a vendor compliance tool? A7. You can. Vendor tools (Tenable.sc, Qualys Policy Compliance, Rapid7 InsightVM) wrap OpenSCAP + their own checks. They produce nice dashboards. They do not produce code your operators can edit. The combination — vendor tool for dashboards/alerts, Ansible+OpenSCAP for code-of-record — is what mature programs run. Buy the dashboard; never let the dashboard own your remediation.
Q8. How do we handle classified or air-gapped environments?
A8. Mirror SSG content locally (scap-security-guide is a single RPM), build a private collection in your Automation Hub, and have AAP execution nodes inside the air-gap. The pattern is identical; the network is the only thing that changes. We will cover the air-gap topology in the Air-Gapped Ansible Operations lesson.
Q9. The role I write conflicts with the upstream RHEL9-STIG role. Who wins?
A9. You do, but only if you fork. Never run two compliance roles in the same play — the second one will undo the first one’s idempotent state. Fork upstream, audit the diff, pin to a specific tag, and own it like any other dependency.
Q10. What is the single compliance metric that actually matters? A10. Time-to-recovery from drift. Not pass-rate, not coverage, not number of rules encoded. If a sysadmin disables SELinux at 09:00 and the next compliance run at 21:00 re-enforces it, your TTR is 12 hours. If you catch and remediate within 30 minutes (via EDA), it’s 30 minutes. Auditors care; attackers care; engineers should care. Optimise for that.
Lab — harden a RHEL 9 lab host end-to-end
# 1. Spin a free RHEL 9 box (Vagrant, AWS Free Tier, or Hetzner/DigitalOcean)
vagrant init generic/rhel9 && vagrant up
# 2. Pull SSG and scanner — pre-flight check
ssh vagrant@10.0.0.10 "sudo dnf -y install scap-security-guide openscap-scanner"
# 3. Run the role
git clone https://github.com/ansible-lockdown/RHEL9-STIG kv.rhel9_stig
ansible-playbook -i inventory.yml site.yml \
-e rhel9stig_disruption_high=false \
--tags cat1,cat2
# 4. Scan
ansible -i inventory.yml all -m ansible.builtin.shell -a \
'oscap xccdf eval --profile xccdf_org.ssgproject.content_profile_stig \
--report /tmp/report.html /usr/share/xml/scap/ssg/content/ssg-rhel9-ds.xml; \
echo $?'
# 5. Fetch evidence
ansible -i inventory.yml all -m ansible.builtin.fetch -a \
'src=/tmp/report.html dest=evidence/'
Expected outcome: first run reports 80–120 failures (depending on the base image), second run after the role applies reports 5–20 (the legitimate exceptions for your environment). Document each remaining failure as an exception or a follow-up ticket. A green scan is a sign that your exceptions are accurate, not that nothing is wrong.
Glossary
- STIG (Security Technical Implementation Guide) — DoD-published configuration standard, machine-readable as XCCDF/OVAL XML.
- CIS Benchmark — Center for Internet Security configuration standard; commercial sector default.
- XCCDF — Extensible Configuration Checklist Description Format. The XML schema for compliance rules.
- OVAL — Open Vulnerability and Assessment Language. The XML schema for the checks that prove a rule.
- OpenSCAP /
oscap— Open-source scanner that consumes XCCDF+OVAL and emits reports. - SSG (SCAP Security Guide) — The upstream open-source project that ships free SCAP content for RHEL/Fedora/Ubuntu.
- Profile — A named subset of rules within a content pack (e.g.
xccdf_org.ssgproject.content_profile_stig). - CAT I/II/III — STIG severity (1 = critical, 3 = informational).
- Level 1/2 — CIS severity (1 = baseline, 2 = stricter).
- Drift — divergence between desired (Ansible-managed) state and observed state (what’s actually on the host).
- Exception — a documented, time-bound, signed-off departure from a rule.
- Evidence pack — the audit deliverable: code version + run logs + scan reports + exceptions.
- TTR — Time To Remediation; how fast you re-converge after drift.
Cert mapping
This lesson supports EX415 (Red Hat Certified Specialist in Security: Linux), EX319 (Security Hardening with Red Hat Identity Management), and the security domains of EX374 (RHCEAA — Ansible Automation Platform).
For DoD environments, the lesson aligns with DISA STIG V2R3 for RHEL 9 (released 2024-04-15). For commercial environments, CIS Benchmark v1.0.0 for RHEL 9 (released 2023-11-30). For Windows, CIS Microsoft Windows Server 2022 v2.0.0 and DISA STIG V1R5.
Real-world adjacent material: SOC 2 CC6 (logical access), PCI-DSS 4.0 sections 2 (configuration), 6 (secure development), 10 (logging), and 11 (vulnerability management); HIPAA Security Rule §164.308 (administrative safeguards) and §164.312 (technical safeguards).
Next steps
The next lesson is Disaster Recovery Automation with Ansible — how to encode RPO/RTO targets, dual-site replication, and failover runbooks as code. Compliance hardens hosts; DR keeps them alive. Together they are what separates a regulated production stack from a science project.