Ansible Lesson 33 of 42

Ansible for Security Compliance, In Depth: STIG, CIS Benchmarks, OpenSCAP & Policy-as-Code

Ansible for Security Compliance, In Depth — STIG, CIS, OpenSCAP and Policy-as-Code

Compliance work is the part of operations that breaks otherwise good engineers. The frameworks themselves are not the problem; the problem is that they are written for humans, in PDFs, with thousands of pages of conditional rules, and the auditor expects you to prove every rule on every host on a quarterly cadence. That is impossible with checklists. It is straightforward with Ansible, OpenSCAP, and a little discipline.

This lesson is the specialist-tier playbook for operationalising security compliance: how to translate DISA STIG content, the CIS Benchmarks, and PCI/HIPAA/SOC 2 control families into Ansible roles that harden, scan, remediate, and re-scan automatically — and how to produce the evidence packages your auditors actually accept.

We will cover RHEL 9 (the canonical STIG target), Ubuntu 22.04 LTS (the canonical CIS target), and Windows Server 2022 (which has both DISA STIG content and a Microsoft baseline). We will not cover the dozen vendor “compliance scanners” that pretend Ansible does not exist; we will use the open-source toolchain (openscap, ansible-hardening, ansible-lockdown) that the regulators themselves consume.

Position in the curriculum. This lesson assumes you are comfortable with everything from Tier 1 through Tier 4 — roles, dynamic inventory, AAP workflows, vault, no_log discipline, and at least one cloud collection. Compliance work touches every layer, so we will reference earlier lessons frequently.


What “compliance” really means in the Ansible context

A common mistake is to treat compliance as a one-time hardening pass. It is not. Compliance is a control loop, and Ansible is the actuator. The loop has four moving parts:

  1. Policy — the authoritative document. STIG, CIS, PCI-DSS, HIPAA Security Rule, SOC 2 CC6.x. You do not edit these; you encode them.
  2. Configuration — the desired state of every host, derived from the policy. This lives in your Ansible roles and host/group vars.
  3. Drift detection — the scan. Run OpenSCAP (oscap) or vendor tooling against each host on a schedule and compare actual to desired.
  4. Remediation — the playbook that reasserts desired state when drift is found. This is just ansible-playbook with the right roles loaded; the same code that originally hardened the host re-hardens it.

The whole loop is the deliverable. Auditors do not ask “is this host compliant today?” They ask “show me how you know it is compliant every day, who can break it, and how fast you re-converge.” The answer is your AAP workflow, your scan reports, your RBAC, and your remediation history — not a PDF.


Module 1 — The standards landscape

Before any code, you need a working mental model of the documents you are encoding.

Standard Maintainer Format Coverage Best for
DISA STIG US Department of Defense XCCDF + OVAL XML, plus .zip of supporting docs RHEL 7/8/9, Ubuntu 20/22, Win 2019/2022, AIX, Solaris, networking gear Federal/DoD, FedRAMP High, regulated contractors
CIS Benchmarks Center for Internet Security PDF + XCCDF (for members) Hundreds of products: every major OS, K8s, AWS/Azure/GCP, M365, browsers Commercial sector, ISO 27001, PCI-DSS
ANSSI BP-028 French ANSSI PDF + XCCDF RHEL/Debian EU public sector
Microsoft Security Baseline Microsoft GPO + XML + scripts Windows Server / Client / M365 Pure Microsoft estates
OpenSCAP / SCAP Security Guide (ssg) upstream open source XCCDF + OVAL + ansible playbook generator RHEL/CentOS/Fedora/Ubuntu/Debian Anyone — this is the engine that runs the others

You will use two layers of tooling no matter which standard you pick:

OpenSCAP is the single piece that ties the world together. RHEL ships it in the base repos (dnf install scap-security-guide openscap-scanner). Ubuntu ships it as ssg-base ssg-debderived libopenscap8. Windows uses OVAL directly through PowerShell DSC or oscap-windows.

The canonical artefact pair

Every compliance project produces two artefacts, and the auditor wants both:

  1. The remediation playbook — Ansible code that brings the host into compliance. You either write it, or generate it from the SCAP profile with oscap xccdf generate fix --fix-type ansible.
  2. The scan report — the HTML/XML output of oscap xccdf eval showing every rule and pass/fail. This is the evidence.

These two artefacts must reference the same profile ID. If they don’t, the auditor will reject the package. The profile ID is the key — for example xccdf_org.ssgproject.content_profile_stig is the RHEL 9 STIG profile shipped by scap-security-guide.

What Ansible owns vs. what OpenSCAP owns

Keep this division of labour clear or you will fight your tooling:

Why hand-curate instead of oscap-generate? Because compliance rules conflict with operational reality. STIG V-230229 says “audit daemon must drop logs to disk every 50 events”; your Splunk forwarder needs audispd running. STIG V-230502 says “remove nfs-utils”; your shared storage runs over NFS. The generated playbook will happily delete nfs-utils and break production. You write the role; OpenSCAP only reports.


Module 2 — Building a hardening role: the structural pattern

This is the canonical layout I use for every compliance role I have shipped in the last seven years. It scales from a single host to ten thousand and from one standard to a dozen overlapping ones.

roles/
└── kv.compliance_rhel9/
    ├── defaults/
    │   └── main.yml          # all togglable controls, on by default
    ├── handlers/
    │   └── main.yml          # service reloads, sysctl reapply
    ├── meta/
    │   └── main.yml          # min ansible_version, dependencies
    ├── tasks/
    │   ├── main.yml          # dispatcher — imports per-section files
    │   ├── 01_account.yml    # account & password policy (STIG V-230x)
    │   ├── 02_audit.yml      # auditd & audit rules (STIG V-230x)
    │   ├── 03_filesystem.yml # mount options, permissions
    │   ├── 04_kernel.yml     # sysctl, kernel modules
    │   ├── 05_logging.yml    # rsyslog, journald, log retention
    │   ├── 06_pam.yml        # PAM password quality, faillock
    │   ├── 07_ssh.yml        # sshd_config hardening
    │   ├── 08_firewall.yml   # firewalld zones & rich rules
    │   ├── 09_selinux.yml    # SELinux mode + booleans
    │   └── 10_packages.yml   # remove banned, install required
    ├── templates/
    │   ├── auditd.conf.j2
    │   ├── sshd_config.j2
    │   └── pwquality.conf.j2
    └── vars/
        └── main.yml          # rule IDs, control mappings

The pattern in tasks/main.yml is just a dispatcher with explicit imports — never use dynamic include_tasks here, because we want a single deterministic execution order under --check:

# roles/kv.compliance_rhel9/tasks/main.yml
- name: Section 01  Account and password policy
  ansible.builtin.import_tasks: 01_account.yml
  tags: [account, stig, cat2]

- name: Section 02  Audit subsystem
  ansible.builtin.import_tasks: 02_audit.yml
  tags: [audit, stig, cat1]

- name: Section 03  Filesystem hardening
  ansible.builtin.import_tasks: 03_filesystem.yml
  tags: [filesystem, stig, cat2]

# ...etc

Tagging convention

Tag every task with three things — and only three — so the auditor and the operator can both navigate:

This lets you say --tags stig,cat1 and apply only the highest-severity STIG controls — which is exactly the dance you do during break-fix windows when you have 30 minutes and need to fix the bleeding.

Defaults file: the togglable surface

Every control must be a variable so operators can override per host or group. Defaults live in defaults/main.yml and are on for compliance by default:

# roles/kv.compliance_rhel9/defaults/main.yml

# --- Section 01: account & password ---
kv_compliance_password_minlen: 15           # STIG V-230367
kv_compliance_password_max_days: 60         # STIG V-230367
kv_compliance_password_remember: 5          # STIG V-230368
kv_compliance_account_lockout_attempts: 3   # STIG V-230333
kv_compliance_account_lockout_unlock: 0     # 0 = require admin
kv_compliance_password_quality_dcredit: -1
kv_compliance_password_quality_ucredit: -1
kv_compliance_password_quality_lcredit: -1
kv_compliance_password_quality_ocredit: -1
kv_compliance_password_quality_minclass: 4

# --- Section 02: audit ---
kv_compliance_auditd_max_log_file_mb: 8     # STIG V-230398
kv_compliance_auditd_disk_full_action: "halt"  # CAT 1 says halt; we soften to syslog in non-prod
kv_compliance_auditd_space_left_mb: 100

# --- Section 07: ssh ---
kv_compliance_ssh_permit_root: false         # STIG V-230296
kv_compliance_ssh_password_auth: false       # CIS 5.2.10
kv_compliance_ssh_max_auth_tries: 3
kv_compliance_ssh_login_grace_time: 60
kv_compliance_ssh_client_alive_interval: 300
kv_compliance_ssh_client_alive_max: 3
kv_compliance_ssh_allowed_users: []          # group_vars overrides

# --- Section 09: SELinux ---
kv_compliance_selinux_state: "enforcing"
kv_compliance_selinux_policy: "targeted"

Two non-obvious points here:

  1. Every default is named after the standard’s variable convention (in this case ssg’s var_password_pam_minlen → our kv_compliance_password_minlen). This means the auditor can grep your repo for the rule and find your value in seconds.
  2. Every default has a comment with the rule ID. This is the single highest-leverage habit you can build. You will thank past-you a thousand times.

Module 3 — A real STIG control, end to end

Let us encode one rule completely so the pattern sticks. We will pick STIG V-230230 — “RHEL 9 must have the AIDE package installed”. AIDE is a file-integrity monitor; it is mandatory for CAT II.

The rule, in plain English

The Advanced Intrusion Detection Environment (AIDE) package must be installed. AIDE provides file integrity monitoring; the absence of AIDE means a finding of CAT II.

The Ansible task

# roles/kv.compliance_rhel9/tasks/02_audit.yml (excerpt)

- name: STIG V-230230  AIDE package installed
  ansible.builtin.dnf:
    name: aide
    state: present
  tags: [audit, stig, cat2, V-230230]

- name: STIG V-230229  AIDE database initialised once
  ansible.builtin.command: aide --init
  args:
    creates: /var/lib/aide/aide.db.new.gz
  tags: [audit, stig, cat2, V-230229]

- name: STIG V-230229  AIDE database moved into place
  ansible.builtin.copy:
    src: /var/lib/aide/aide.db.new.gz
    dest: /var/lib/aide/aide.db.gz
    remote_src: true
    owner: root
    group: root
    mode: "0600"
  tags: [audit, stig, cat2, V-230229]

- name: STIG V-230229  AIDE timer enabled
  ansible.builtin.systemd_service:
    name: aide-check.timer
    enabled: true
    state: started
  tags: [audit, stig, cat2, V-230229]

Notice the rule ID is in both the name and the tag list. This is non-negotiable. When the scanner reports xccdf_org.ssgproject.content_rule_package_aide_installed failed, you can grep for V-230230 and land on the exact tasks that should fix it.

The OpenSCAP equivalent

For comparison, here is what oscap xccdf generate fix --fix-type ansible emits for the same rule:

- name: 'Install aide Package'
  package:
    name: aide
    state: present
  tags:
    - CCE-83438-7
    - PCI-DSS-Req-11.5
    - NIST-800-53-CM-3
    - DISA-STIG-RHEL-09-651010
    - low_severity
    - low_complexity
    - low_disruption

It looks fine, but it is not idempotent in your environment — it does not initialise the database, it does not enable the timer, and it does not mode/own the database file. That is why you write the role yourself and use SCAP only to scan.

The handler (if required)

For rules that touch a service config, you wire a handler:

# roles/kv.compliance_rhel9/handlers/main.yml
- name: Reload sshd
  ansible.builtin.systemd_service:
    name: sshd
    state: reloaded
- name: Reload auditd
  ansible.builtin.command: augenrules --load
  notify: Restart auditd
- name: Restart auditd
  ansible.builtin.systemd_service:
    name: auditd
    state: restarted

Note auditd cannot be reloaded on RHEL — augenrules --load reads the rules then systemctl restart auditd re-applies them. This is one of the single most-broken patterns in copy-pasted STIG roles. Get it right.


Module 4 — SSH hardening: the single most-violated control family

If you only do one section, do SSH. Auditors look at it first. Here is the canonical template — fully variable-driven, fully tagged, fully traceable to STIG and CIS:

# roles/kv.compliance_rhel9/templates/sshd_config.j2
# Managed by Ansible — do not edit by hand.
# Source: roles/kv.compliance_rhel9/templates/sshd_config.j2
# STIG profile: stig | CIS profile: stig_gui

Port 22
AddressFamily inet
ListenAddress 0.0.0.0

# STIG V-230296 / CIS 5.2.8 — root login forbidden
PermitRootLogin {{ 'no' if not kv_compliance_ssh_permit_root else 'prohibit-password' }}

# STIG V-230282 / CIS 5.2.10
PasswordAuthentication {{ 'yes' if kv_compliance_ssh_password_auth else 'no' }}
PermitEmptyPasswords no

# STIG V-230288
PubkeyAuthentication yes
ChallengeResponseAuthentication no
KbdInteractiveAuthentication no

# STIG V-230285 / CIS 5.2.18
MaxAuthTries {{ kv_compliance_ssh_max_auth_tries }}
LoginGraceTime {{ kv_compliance_ssh_login_grace_time }}
MaxSessions 4

# STIG V-230381 — idle session termination
ClientAliveInterval {{ kv_compliance_ssh_client_alive_interval }}
ClientAliveCountMax {{ kv_compliance_ssh_client_alive_max }}

# STIG V-230332 — banner
Banner /etc/issue.net

# STIG V-230290 — strong KEX/MAC/cipher only (FIPS-aligned)
KexAlgorithms     curve25519-sha256,curve25519-sha256@libssh.org,diffie-hellman-group16-sha512
MACs              hmac-sha2-512,hmac-sha2-256
Ciphers           aes256-gcm@openssh.com,aes128-gcm@openssh.com,aes256-ctr,aes128-ctr
HostKeyAlgorithms ssh-ed25519,rsa-sha2-512,rsa-sha2-256

# STIG V-230289 — protocol 2 only
Protocol 2

UseDNS no
UsePAM yes
X11Forwarding no
PrintMotd no
AcceptEnv LANG LC_*
Subsystem sftp /usr/libexec/openssh/sftp-server -f AUTHPRIV -l INFO

{% if kv_compliance_ssh_allowed_users %}
AllowUsers {{ kv_compliance_ssh_allowed_users | join(' ') }}
{% endif %}

The accompanying task:

# roles/kv.compliance_rhel9/tasks/07_ssh.yml

- name: STIG V-230282..230296  sshd_config baseline
  ansible.builtin.template:
    src: sshd_config.j2
    dest: /etc/ssh/sshd_config
    owner: root
    group: root
    mode: "0600"
    validate: /usr/sbin/sshd -T -f %s   # validate before swap
  notify: Reload sshd
  tags: [ssh, stig, cis, cat1]

- name: STIG V-230290  host key permissions
  ansible.builtin.file:
    path: "{{ item }}"
    owner: root
    group: ssh_keys
    mode: "0640"
  loop:
    - /etc/ssh/ssh_host_rsa_key
    - /etc/ssh/ssh_host_ed25519_key
    - /etc/ssh/ssh_host_ecdsa_key
  failed_when: false  # not all key types exist on every host
  tags: [ssh, stig, cat2]

- name: STIG V-230332  login banner
  ansible.builtin.copy:
    dest: /etc/issue.net
    content: |
      You are accessing a U.S. Government (USG) Information System (IS).
      [...full DoD banner content...]
    owner: root
    group: root
    mode: "0644"
  tags: [ssh, stig, cat3]

The validate: parameter is the magic that makes this safe. sshd -T -f %s parses the candidate config; if it fails, template aborts and the host keeps the working config. Without validate:, a typo in the template can lock you out of every host in your fleet at the same time. Always use validate on services you depend on for management access.


Module 5 — auditd rule files

Audit rules are the second-most-violated family. The pattern is to ship rule fragments under /etc/audit/rules.d/ and let augenrules --load compile them.

# roles/kv.compliance_rhel9/tasks/02_audit.yml (continued)

- name: STIG V-230398..230428  audit rules (drop-in fragments)
  ansible.builtin.copy:
    src: "rules.d/{{ item }}"
    dest: "/etc/audit/rules.d/{{ item }}"
    owner: root
    group: root
    mode: "0640"
  loop:
    - 10-base-config.rules
    - 30-stig.rules        # 200+ rule entries
    - 40-local.rules
    - 99-finalize.rules
  notify: Reload audit rules
  tags: [audit, stig, cat2]

- name: STIG V-230401  auditd buffer
  community.general.ini_file:
    path: /etc/audit/auditd.conf
    section: null
    option: max_log_file
    value: "{{ kv_compliance_auditd_max_log_file_mb }}"
    no_extra_spaces: true
  notify: Restart auditd
  tags: [audit, stig, cat2]

A representative rule fragment (30-stig.rules — drop in files/rules.d/):

## STIG V-230402 — audit time-change syscalls
-a always,exit -F arch=b64 -S adjtimex,settimeofday,clock_settime -k time-change
-a always,exit -F arch=b32 -S adjtimex,settimeofday,clock_settime -k time-change

## STIG V-230410 — audit ownership changes
-a always,exit -F arch=b64 -S chown,fchown,fchownat,lchown -F auid>=1000 -F auid!=-1 -k perm_mod

## STIG V-230412 — audit DAC changes
-a always,exit -F arch=b64 -S chmod,fchmod,fchmodat -F auid>=1000 -F auid!=-1 -k perm_mod

## STIG V-230418 — audit failed unauthorised access
-a always,exit -F arch=b64 -S open,openat,creat,truncate,ftruncate -F exit=-EACCES -F auid>=1000 -F auid!=-1 -k access
-a always,exit -F arch=b64 -S open,openat,creat,truncate,ftruncate -F exit=-EPERM  -F auid>=1000 -F auid!=-1 -k access

These are byte-identical to the SSG contentcat /usr/share/scap-security-guide/ansible/rhel9-playbook-stig.yml | grep -A2 audit_rules_time_change will show you the same lines. If the auditor compares yours to ssg, they should match exactly.


Module 6 — Driving OpenSCAP from Ansible

The scanning side closes the loop. The pattern is scan → fail open → publish report → never hide failures:

# roles/kv.compliance_rhel9/tasks/scan.yml — separate role/playbook

- name: Ensure scanner present
  ansible.builtin.dnf:
    name:
      - openscap-scanner
      - scap-security-guide
    state: present

- name: Run OpenSCAP scan
  ansible.builtin.command: >
    oscap xccdf eval
      --profile xccdf_org.ssgproject.content_profile_stig
      --results /var/log/openscap/{{ inventory_hostname }}-results.xml
      --report  /var/log/openscap/{{ inventory_hostname }}-report.html
      --fetch-remote-resources
      /usr/share/xml/scap/ssg/content/ssg-rhel9-ds.xml
  register: scan
  changed_when: false
  failed_when: false   # scanner always exits 2 on findings; do NOT fail here
  tags: [scan, never]

- name: Fetch report to control node
  ansible.builtin.fetch:
    src: "/var/log/openscap/{{ inventory_hostname }}-report.html"
    dest: "evidence/{{ ansible_date_time.date }}/"
    flat: false
  tags: [scan, never]

- name: Pull machine-readable XML for diffing
  ansible.builtin.fetch:
    src: "/var/log/openscap/{{ inventory_hostname }}-results.xml"
    dest: "evidence/{{ ansible_date_time.date }}/"
    flat: false
  tags: [scan, never]

Critical detail: oscap exits 2 when it finds rule failures, 0 when fully compliant. Do not let Ansible interpret 2 as failure — set failed_when: false. The scan is allowed to find issues; that is its job. Failure of the play is only when the binary itself crashes.

The reports go to evidence/$DATE/ on the control node. Wire a job in AAP to commit those to a separate kv-evidence Git repo, never the main Ansible repo. Auditors get read-only access to that repo and can clone for forensics.


Module 7 — Multi-standard overlay: STIG + CIS + PCI on the same host

Real environments rarely have one standard. A typical regulated SaaS company runs PCI-DSS for the payment subsystem and SOC 2 across the whole estate; a federal contractor runs STIG and FISMA. You handle this with profile variables, not duplicate roles.

# inventory/group_vars/payment_systems.yml
kv_compliance_profile: pci
kv_compliance_password_minlen: 12       # PCI 8.2.3 minimum
kv_compliance_password_max_days: 90
kv_compliance_account_lockout_attempts: 6

# inventory/group_vars/federal.yml
kv_compliance_profile: stig
kv_compliance_password_minlen: 15
kv_compliance_password_max_days: 60
kv_compliance_account_lockout_attempts: 3

# inventory/group_vars/all.yml — defaults
kv_compliance_profile: cis_level1

The same role applies to all three groups. Variable precedence resolves the difference. Where rules genuinely conflict (PCI 8.2.4 says 90 days max; STIG says 60 days max), pick the stricter and document it — auditors of both standards then accept the same evidence.

The role’s defaults/main.yml should reflect the strictest baseline (STIG), and overrides relax for less-strict standards. This is the right direction: it is much easier to audit-explain “we tightened beyond CIS” than “we loosened below STIG”.


Module 8 — CIS for Ubuntu and the Lockdown collection

For Ubuntu, the open-source community ships ansible-lockdown/UBUNTU22-CIS — a maintained role that follows the same shape we just built. You do not need to reinvent it; you fork it, pin it, and override variables.

# requirements.yml
roles:
  - name: kv.cis_ubuntu22
    src: https://github.com/ansible-lockdown/UBUNTU22-CIS
    version: 1.6.1
    scm: git

Then in your playbook:

- hosts: ubuntu_servers
  become: true
  roles:
    - role: kv.cis_ubuntu22
      vars:
        ubtu22cis_level_2: false           # we run Level 1
        ubtu22cis_rule_1_1_1_1: true       # cramfs disabled
        ubtu22cis_rule_5_2_2_2: true       # SSH PermitRoot no
        ubtu22cis_rule_3_5_1_4: false      # ufw disabled in our env (we use security groups)

Two reasons this works in production:

  1. The role is declarative: every CIS rule is a Boolean ubtu22cis_rule_X_Y_Z toggle, default true. You disable the ones that conflict with your environment by setting false and documenting why in a CHANGELOG.
  2. The role’s tags mirror CIS section IDs (section1, section3, level1, level2), so you can run partial reapplications.

For RHEL the equivalent is ansible-lockdown/RHEL9-STIG and RHEL9-CIS. Use them. You are not paid to write SSH hardening from scratch every quarter.


Module 9 — Windows compliance with microsoft.security baseline

Windows is the same pattern, executed differently. Microsoft publishes Security Baselines for Windows Server 2022 and Windows 11; DISA publishes a STIG for the same. The baselines arrive as GPOs, but Microsoft also ships them as Ansible roles via the microsoft.security collection (preview at time of writing) and through PowerShell DSC, which Ansible can drive via ansible.windows.win_dsc.

A working pattern:

- hosts: windows_servers
  tasks:
    - name: Apply Windows 2022 STIG via DSC
      ansible.windows.win_dsc:
        resource_name: PowerSTIG
        Version: "1.0.0"
        OsVersion: "2022"
        OsRole: "MS"
        StigVersion: "2.7"
        SkipRule:
          - V-254241   # we accept residual risk; see ticket KV-1294
          - V-254376
      tags: [windows, stig, cat2]

    - name: Local audit policy  overrides
      ansible.windows.win_audit_policy_system:
        subcategory: "{{ item.sub }}"
        audit_type: "{{ item.type }}"
      loop:
        - { sub: "Logon", type: "success and failure" }
        - { sub: "Account Lockout", type: "success and failure" }
        - { sub: "Process Creation", type: "success" }
      tags: [windows, audit, cat2]

PowerSTIG is an excellent project that compiles DISA’s XML into DSC resources; combined with Ansible’s win_dsc you get the same closed loop. Skip rules with SkipRule: — never silently — and always link the skip to a ticket number.

For scans, oscap-windows + scap-security-guide-windows give you the same OpenSCAP evidence pack. Run it under WinRM the same way you run RHEL OpenSCAP under SSH.


Module 10 — Wiring the workflow in AAP

The workflow that closes the loop has four jobs:

  1. compliance-harden — runs the role, only against hosts in compliance_managed group, on a schedule (weekly, off-hours).
  2. compliance-scan — runs OpenSCAP and stores reports.
  3. compliance-evidence-publish — pushes reports to the evidence Git repo and tags the commit with the run ID.
  4. compliance-drift-alert — diffs today’s scan against yesterday’s; if rule failures increase, page the on-call.
# AAP Workflow definition (rendered):

start
 ├─> compliance-harden           # runs always
     └─> on success ─> compliance-scan
                        └─> always ─> compliance-evidence-publish
                                       └─> always ─> compliance-drift-alert
 └─> on failure ─> page-oncall + freeze-deploy-flag

A few non-obvious details:


Module 11 — Evidence engineering

Auditors do not want logs. They want answers, with proof. The evidence package per quarterly audit should contain:

evidence/Q2-2026/
├── README.md                       # what's in this pack, who produced it, when
├── policies/
│   ├── stig-V2R3.zip              # the policy document version we encoded
│   └── cis-rhel9-1.0.0.pdf
├── code/
│   └── kv-compliance-rhel9-v3.4.0.tar.gz   # exact tagged version of role applied
├── runs/
│   └── 2026-04-15-harden-run.json # AAP job result (per host pass/fail)
├── scans/
│   ├── 2026-04-16-rhel9-scan-results-cluster1.html  # report.html per host
│   ├── 2026-04-16-rhel9-scan-results-cluster1.xml
│   └── ...
├── exceptions/
│   ├── kv-1294-V-254241.md        # signed exception with risk acceptance
│   └── kv-1297-V-230502.md
└── attestation.pdf                # signed by CISO + sysadmin lead

This is shipped as a single tarball with a SHA-256 hash recorded in your ticketing system. It is the same shape every quarter. Building a script that produces it from your AAP run history takes one afternoon and saves you weeks of audit prep forever.


Module 12 — The exception process (where most teams fail)

Every compliance program fails not on the encoded controls but on exceptions. STIG V-230502 says “remove nfs-utils”; you cannot, because your storage is NFS. CIS 6.1.10 says “remove tcpdump”; you cannot, because you debug network issues with it. PCI 6.4.5 says “every change requires CAB approval”; AAP runs hundreds of changes a day.

You handle this with a structured exception:

# KV-EXCEPTION-1294

- **Rule**: STIG V-230502 — `nfs-utils` package
- **Standard**: DISA STIG RHEL 9 V2R3
- **Decision**: Accept residual risk
- **Reason**: Production storage for `prod-app-cluster` mounts NFS from on-prem NetApp. Removing `nfs-utils` would break business-critical workflows.
- **Compensating controls**:
  - NFSv4 with sec=krb5p (Kerberos integrity + privacy)
  - dedicated VLAN, no internet egress
  - Network monitoring via Suricata IDS, alerting on non-NetApp traffic
  - Quarterly NFS audit — evidence/Q*/scans/nfs-traffic-anomalies.csv
- **Approved by**: J. Smith (CISO), 2026-03-01
- **Reviewed by**: P. Jones (sysadmin lead), 2026-03-01
- **Expiry**: 2026-09-01 (re-review)
- **Linked tasks**: roles/kv.compliance_rhel9/vars/main.yml — kv_compliance_remove_nfs_utils: false

Two things make this work:

  1. The exception lives next to the code that implements it. When the operator looks at kv_compliance_remove_nfs_utils: false, the next line in the comment is the exception number.
  2. The exception expires. Six months is the default. At expiry, the exception is automatically re-reviewed; if the compensating controls still hold, it renews. If they do not, the exception is closed and the rule re-asserted.

Auditors will read these. Make them readable. Use the same template every time.


Module 13 — Three frequently asked questions

Q1. We have 5,000 RHEL hosts. Won’t running oscap on each take hours? A1. About 8–12 minutes per host on modern hardware. With AAP forks=200 and a 5-host serial step, you finish a fleet-wide scan in under an hour. The bottleneck is usually the --fetch-remote-resources step — pre-mirror the SCAP content locally and pass --datastream to skip it.

Q2. What about regulatory frameworks that aren’t OS-level (PCI 4.0, HIPAA Security Rule, SOC 2)? A2. They map onto OS, network, and application controls. SOC 2 CC6.1 — “logical access” — is your SSH/PAM hardening + IAM. PCI 6.5 — “secure development” — is your linting + Molecule + image scanning. There is no “SOC 2 oscap profile”; instead, your compliance role implements the OS controls, your CI implements the application controls, and you maintain a control mapping spreadsheet that points each framework requirement at the role tag, CI pipeline step, or runbook that satisfies it.

Q3. The DISA STIG releases a new version (V2R4 → V2R5). What changes? A3. Maybe 2–10 rules per quarter — added, removed, severity-changed. The pattern: subscribe to the DISA RSS feed, diff the new XCCDF against the previous one (oscap info --references), and produce a delta-PR against your role. Keep your role version-pinned to the STIG version (role v3.4.0STIG V2R3), bump only when you’ve absorbed the diff. Never run with a STIG version newer than your role.

Q4. Is --check mode safe to run for compliance scans? A4. Yes for Ansible-side validation — it tells you what would change. But --check is not equivalent to an OpenSCAP scan. SCAP checks the actual file contents on disk; Ansible --check only knows about what tasks would do. Run both: --check weekly for fast sanity, OpenSCAP nightly for legal evidence.

Q5. We want to use the same role for production and dev. How do we handle “this is just a dev box, don’t break ssh”? A5. Don’t relax controls based on environment — relax based on audience. Your dev hosts are still subject to compliance if a developer with access to production code logs in. The right answer is: same role, same defaults, but dev hosts allow extra users via kv_compliance_ssh_allowed_users, and dev hosts get a longer client_alive_interval so dev sessions don’t time out during long debug sessions. The compliance posture stays identical; only the operational comfort relaxes.

Q6. What about agent-based compliance tools (Wazuh, Tenable, Qualys)? A6. Use them for continuous monitoring (real-time file integrity, log analysis), not for quarterly compliance. They live alongside, not instead of, the OpenSCAP + Ansible loop. Your evidence packages quote OpenSCAP; your incident response cites Wazuh.

Q7. Why not just buy a vendor compliance tool? A7. You can. Vendor tools (Tenable.sc, Qualys Policy Compliance, Rapid7 InsightVM) wrap OpenSCAP + their own checks. They produce nice dashboards. They do not produce code your operators can edit. The combination — vendor tool for dashboards/alerts, Ansible+OpenSCAP for code-of-record — is what mature programs run. Buy the dashboard; never let the dashboard own your remediation.

Q8. How do we handle classified or air-gapped environments? A8. Mirror SSG content locally (scap-security-guide is a single RPM), build a private collection in your Automation Hub, and have AAP execution nodes inside the air-gap. The pattern is identical; the network is the only thing that changes. We will cover the air-gap topology in the Air-Gapped Ansible Operations lesson.

Q9. The role I write conflicts with the upstream RHEL9-STIG role. Who wins? A9. You do, but only if you fork. Never run two compliance roles in the same play — the second one will undo the first one’s idempotent state. Fork upstream, audit the diff, pin to a specific tag, and own it like any other dependency.

Q10. What is the single compliance metric that actually matters? A10. Time-to-recovery from drift. Not pass-rate, not coverage, not number of rules encoded. If a sysadmin disables SELinux at 09:00 and the next compliance run at 21:00 re-enforces it, your TTR is 12 hours. If you catch and remediate within 30 minutes (via EDA), it’s 30 minutes. Auditors care; attackers care; engineers should care. Optimise for that.


Lab — harden a RHEL 9 lab host end-to-end

# 1. Spin a free RHEL 9 box (Vagrant, AWS Free Tier, or Hetzner/DigitalOcean)
vagrant init generic/rhel9 && vagrant up

# 2. Pull SSG and scanner — pre-flight check
ssh vagrant@10.0.0.10 "sudo dnf -y install scap-security-guide openscap-scanner"

# 3. Run the role
git clone https://github.com/ansible-lockdown/RHEL9-STIG kv.rhel9_stig
ansible-playbook -i inventory.yml site.yml \
  -e rhel9stig_disruption_high=false \
  --tags cat1,cat2

# 4. Scan
ansible -i inventory.yml all -m ansible.builtin.shell -a \
  'oscap xccdf eval --profile xccdf_org.ssgproject.content_profile_stig \
     --report /tmp/report.html /usr/share/xml/scap/ssg/content/ssg-rhel9-ds.xml; \
   echo $?'

# 5. Fetch evidence
ansible -i inventory.yml all -m ansible.builtin.fetch -a \
  'src=/tmp/report.html dest=evidence/'

Expected outcome: first run reports 80–120 failures (depending on the base image), second run after the role applies reports 5–20 (the legitimate exceptions for your environment). Document each remaining failure as an exception or a follow-up ticket. A green scan is a sign that your exceptions are accurate, not that nothing is wrong.


Glossary


Cert mapping

This lesson supports EX415 (Red Hat Certified Specialist in Security: Linux), EX319 (Security Hardening with Red Hat Identity Management), and the security domains of EX374 (RHCEAA — Ansible Automation Platform).

For DoD environments, the lesson aligns with DISA STIG V2R3 for RHEL 9 (released 2024-04-15). For commercial environments, CIS Benchmark v1.0.0 for RHEL 9 (released 2023-11-30). For Windows, CIS Microsoft Windows Server 2022 v2.0.0 and DISA STIG V1R5.

Real-world adjacent material: SOC 2 CC6 (logical access), PCI-DSS 4.0 sections 2 (configuration), 6 (secure development), 10 (logging), and 11 (vulnerability management); HIPAA Security Rule §164.308 (administrative safeguards) and §164.312 (technical safeguards).


Next steps

The next lesson is Disaster Recovery Automation with Ansible — how to encode RPO/RTO targets, dual-site replication, and failover runbooks as code. Compliance hardens hosts; DR keeps them alive. Together they are what separates a regulated production stack from a science project.

ansiblecompliancestigcisopenscaphardeningsecurity
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments