Ansible Lesson 39 of 42

Ansible × ITSM & ChatOps, In Depth: ServiceNow CMDB Inventory, Change-Gated Job Templates, Event-Driven Approvals & Slack/Teams Bidirectional Flows

This is one of the lessons that, if you implement it well, fundamentally changes how your organisation perceives “automation.” Up to this point in the series, your playbooks have been triggered by humans on a CLI, by Git pushes, or by AAP schedules. That is fine for sandbox and pre-prod. In production at a regulated enterprise — banks, insurers, healthcare, telecom, utilities, anything that ships to SOX, SOC 2, ISO 27001, HIPAA or PCI-DSS — there is a hard organisational rule that automation must obey:

No production change happens without an approved Change ticket.

And a softer but equally important rule:

No production change happens silently. Operators see it in the same channel where they see everything else — usually Slack or Teams.

These two rules turn ITSM and ChatOps from “nice integrations” into the control plane of your automation. ServiceNow (or BMC Helix, or Jira Service Management) becomes the authority on what is allowed to run, against what, when, and by whom. Slack/Teams becomes the human surface of the automation: the place where engineers approve, query state, and trigger safe operations without leaving the conversation.

This lesson is the deep-dive into that wiring. We will cover the four patterns that, together, define a mature ITSM + ChatOps integration:

  1. ServiceNow CMDB as a dynamic inventory — the CMDB becomes Ansible’s source of truth for hosts, applications, business services, and ownership.
  2. Change-ticket-as-prerequisite (CHG-gate) — AAP job templates refuse to run unless an approved CHG ticket exists, is in the right state, has the right CIs attached, and is inside its scheduled window.
  3. Event-Driven Ansible (EDA) rulebooks — incidents in ServiceNow trigger remediation playbooks; results are written back as work notes and the incident is auto-resolved when remediation succeeds.
  4. Slack/Teams ChatOps with real approvals — engineers can run safe ops directly from chat (@kv-bot reboot prod-app-04), and the bot proxies through ServiceNow approval rules so the audit trail still leads back to the change record.

By the end of this lesson you should be able to look at a regulated enterprise’s compliance auditor, point at the chain of evidence from “Slack message → ServiceNow CHG → AAP job → host change → ServiceNow work note → resolved CHG,” and have them sign off without a follow-up question. That is the bar.


1. Why ITSM integration is non-negotiable in regulated enterprises

There is a recurring pattern in mid-sized engineering orgs: the platform team builds beautiful Ansible automation, demos it, and then production teams refuse to adopt it. The reason given is usually “we don’t trust automation in prod.” The real reason, almost every time, is:

Production teams are personally accountable to auditors. They cannot allow a change to land in production unless they can point at a CHG ticket that authorised it.

If your automation cannot produce that ticket-shaped audit artefact, it does not get adopted. Period.

The four classes of ITSM evidence auditors look for, in order of importance:

Evidence What auditors want to see How automation must produce it
Authorisation A CHG ticket in Scheduled or Implement state, approved by named approver(s), referencing the affected CIs AAP refuses to run unless a valid CHG number is supplied and validated
Execution window Change occurred between start_date and end_date of the CHG AAP refuses to run outside the window
Affected CIs The CIs the playbook actually touched match the CIs listed on the CHG AAP enforces inventory ⊆ CHG.affected_cis
Closure Work notes describing what was done; CHG transitioned to Review/Closed with success/failure evidence Playbook writes signed work notes and updates CHG state automatically

A common mistake is to treat ITSM as a notification target — “we’ll just email ServiceNow when a job runs.” That gives auditors no enforcement, no link between change and execution, and no automatic closure. It will fail your first ITGC audit.

The pattern in this lesson treats ServiceNow as a gate, not a notification target. Without a valid, approved, in-window CHG, the playbook does not run. The job’s first task is servicenow.itsm.change_request_info, and the job fails-closed if the lookup returns anything other than an approved, scheduled, CI-matched CHG.


2. servicenow.itsm collection: the connector

Red Hat’s officially-supported collection is servicenow.itsm. It exposes modules for every ITSM table you actually need:

Authentication supports both basic auth (username + password) and OAuth2 (preferred for production). For AAP, you create a custom credential type that maps to the collection’s environment variables:

# inputs schema for AAP custom credential type "ServiceNow OAuth"
fields:
  - id: instance_host
    type: string
    label: ServiceNow instance hostname
  - id: client_id
    type: string
    label: OAuth client ID
  - id: client_secret
    type: string
    label: OAuth client secret
    secret: true
  - id: username
    type: string
    label: Service account username
  - id: password
    type: string
    label: Service account password
    secret: true
required:
  - instance_host
  - client_id
  - client_secret
  - username
  - password

# injectors
env:
  SN_HOST: '{{ instance_host }}'
  SN_CLIENT_ID: '{{ client_id }}'
  SN_CLIENT_SECRET: '{{ client_secret }}'
  SN_USERNAME: '{{ username }}'
  SN_PASSWORD: '{{ password }}'

Now any AAP job template with this credential injected gets ServiceNow access via the standard servicenow.itsm env var contract — no inline secrets, no vars_prompt.

The service account itself needs specific roles in ServiceNow:


3. CMDB as dynamic inventory

The first major pattern is making the CMDB authoritative for inventory. This sounds simple but has far-reaching implications: if it works, you stop maintaining hand-edited inventory files, and the relationship between “what we think we run” and “what Ansible runs against” becomes always-correct-by-construction.

The collection ships an inventory plugin: servicenow.itsm.now. A minimal config:

# inventory/servicenow.yml
---
plugin: servicenow.itsm.now

# pull all server CIs that are operational
table: cmdb_ci_server
sysparm_query: "operational_status=1^install_status=1"

# build groups from CI columns
groups:
  linux: "os.lower() is search('linux|rhel|ubuntu|debian|centos|rocky|alma|sles')"
  windows: "os.lower() is search('windows|win')"
  prod: "support_group.display_value is search('Production')"
  pci_scope: "u_pci_scope == 'true'"

# build group hierarchy from business_application
keyed_groups:
  - key: u_business_application.display_value | lower | replace(' ', '_')
    prefix: app
  - key: u_environment.display_value | lower
    prefix: env
  - key: location.display_value | lower | replace(' ', '_')
    prefix: site

# variables to attach to each host
compose:
  ansible_host: ip_address
  ansible_user: "'ansible-svc' if os.lower() is search('linux') else 'svc-ansible'"
  cmdb_sys_id: sys_id
  cmdb_owner: owned_by.display_value
  cmdb_environment: u_environment.display_value
  cmdb_business_app: u_business_application.display_value
  cmdb_pci_scope: u_pci_scope

What you get for free:

The crucial discipline: CMDB must be the source of truth for hostnames and IPs. If your CMDB is inaccurate, this pattern amplifies the inaccuracy into automation. Most organisations need a 6-12 month CMDB hygiene project before this pattern becomes safe. The “discovery → reconcile → remediate” pattern (running ServiceNow Discovery alongside gather_facts and reconciling differences) is its own multi-week project.

A pragmatic compromise: start with CMDB as inventory for non-production environments where the cost of inaccuracy is low, fix CMDB through the feedback loop, and graduate to prod only when you have a clean reconciliation report.

3.1 Caching to survive ServiceNow rate limits

ServiceNow’s REST API is not designed for bursty inventory queries. With more than ~5,000 CIs and frequent AAP job runs, you will hit rate limits or timeouts. Configure aggressive inventory caching:

# ansible.cfg or inventory.yml
[inventory]
cache = true
cache_plugin = jsonfile
cache_timeout = 1800
cache_connection = /var/cache/ansible/inventory
cache_prefix = snow_

For AAP, the inventory source has an “update on launch” toggle. Disable it for fast-running playbooks; use a scheduled inventory sync (every 15-30 minutes) instead. A stale-by-15-minutes inventory is acceptable; an inventory sync that takes 4 minutes before every job run is not.


4. The CHG-gate pattern

This is the most important pattern in this lesson. The rule is:

Every production job template’s first play, before gather_facts, validates the CHG ticket. If the validation fails, the job fails. There is no override flag.

Job templates expose a survey field change_request_number (string, required, regex ^CHG\d{7,}$). The first play looks like this:

---
- name: Pre-flight CHG validation
  hosts: localhost
  gather_facts: false
  connection: local
  tasks:

    - name: Look up the change request
      servicenow.itsm.change_request_info:
        number: "{{ change_request_number }}"
      register: chg_lookup
      no_log: false  # the CHG metadata itself is not secret

    - name: Fail-closed if CHG not found
      ansible.builtin.fail:
        msg: "CHG {{ change_request_number }} does not exist."
      when: chg_lookup.records | length == 0

    - name: Capture the CHG record
      ansible.builtin.set_fact:
        chg: "{{ chg_lookup.records[0] }}"

    - name: Fail-closed if CHG state is not Scheduled or Implement
      ansible.builtin.fail:
        msg: >
          CHG {{ chg.number }} is in state '{{ chg.state }}'.
          Required state: 'scheduled' or 'implement'.
          Current approver state: {{ chg.approval }}.
      when: chg.state not in ['scheduled', 'implement']

    - name: Fail-closed if CHG is not approved
      ansible.builtin.fail:
        msg: "CHG {{ chg.number }} is not approved (approval={{ chg.approval }})."
      when: chg.approval != 'approved'

    - name: Fail-closed if outside scheduled window
      ansible.builtin.fail:
        msg: >
          CHG {{ chg.number }} window is
          {{ chg.start_date }} → {{ chg.end_date }}.
          Current time {{ ansible_date_time.iso8601 }} is outside the window.
      when: >
        ansible_date_time.iso8601 < chg.start_date
        or ansible_date_time.iso8601 > chg.end_date

    - name: Look up the CIs attached to the CHG
      servicenow.itsm.api:
        resource: cmdb_ci
        action: get
        query_params:
          sysparm_query: "sys_id={{ chg.cmdb_ci }}"
      register: chg_cis
      when: chg.cmdb_ci | length > 0

    - name: Fail-closed if any inventory host is not in CHG.affected_cis
      ansible.builtin.fail:
        msg: >
          Host {{ item }} is not listed in CHG {{ chg.number }} affected CIs.
          CHG covers: {{ chg_ci_names | join(', ') }}.
      when: hostvars[item].cmdb_sys_id not in chg_ci_sys_ids
      loop: "{{ groups['target_hosts'] }}"

    - name: Transition CHG to Implement state
      servicenow.itsm.change_request:
        number: "{{ chg.number }}"
        state: implement
        work_notes: >
          AAP job {{ tower_job_id }} (template '{{ tower_job_template_name }}')
          starting at {{ ansible_date_time.iso8601 }}.
          Triggered by {{ tower_user_name }}.
      when: chg.state == 'scheduled'

What this gives you:

The post-play, run after the main playbook completes, closes the loop:

- name: Post-flight CHG closure
  hosts: localhost
  gather_facts: false
  connection: local
  vars:
    job_succeeded: "{{ ansible_failed_task is not defined }}"
  tasks:
    - name: Render evidence bundle path
      ansible.builtin.set_fact:
        evidence_path: "/var/lib/awx/evidence/{{ tower_job_id }}.tar.gz"

    - name: Attach evidence bundle to CHG
      servicenow.itsm.attachment:
        table_name: change_request
        table_sys_id: "{{ chg.sys_id }}"
        path: "{{ evidence_path }}"
      when: evidence_path is file

    - name: Write closure work note
      servicenow.itsm.change_request:
        number: "{{ chg.number }}"
        work_notes: |
          AAP job {{ tower_job_id }} completed at {{ ansible_date_time.iso8601 }}.
          Status: {{ 'SUCCESS' if job_succeeded else 'FAILED' }}.
          Hosts changed: {{ groups['target_hosts'] | length }}.
          Evidence bundle: attached.
        close_code: "{{ 'successful' if job_succeeded else 'unsuccessful' }}"
        close_notes: "Automated closure by AAP job {{ tower_job_id }}."
        state: "{{ 'review' if job_succeeded else 'implement' }}"

A failed job stays in Implement state — it does not auto-close as failed. That is deliberate: a failure means a human has to investigate and decide what comes next. Automatic closure of failed changes hides incidents.

4.1 Standard changes get a streamlined path

Not every change needs a 5-day CAB approval cycle. ServiceNow has a concept of “standard changes” — pre-approved templates for low-risk, repeatable operations (e.g., “rotate TLS certificate,” “patch low-risk Linux kernel CVE”). The collection supports creating CHGs from a standard change template:

- name: Create standard CHG for cert rotation
  servicenow.itsm.change_request:
    type: standard
    template: "Standard - TLS Certificate Rotation"
    short_description: "Rotate TLS cert for {{ inventory_hostname }}"
    cmdb_ci: "{{ cmdb_sys_id }}"
    assignment_group: "Platform Engineering"
    state: scheduled
    start_date: "{{ ansible_date_time.iso8601 }}"
    end_date: "{{ (ansible_date_time.iso8601 | as_datetime + 30*60) | iso8601 }}"
  register: created_chg

You give engineers a self-service “rotate cert” button in Slack; the bot creates the standard CHG, immediately gets approval, and runs the job. The audit trail still exists, the CAB does not have to meet, and the cycle time drops from days to seconds. This is how mature orgs scale automation without breaking governance.


5. Event-Driven Ansible: incident → remediation → closure loop

The pattern so far is human-initiated, ITSM-gated. The complement is event-initiated, ITSM-recorded: an incident appears in ServiceNow (from monitoring, from a user ticket, from anywhere), EDA detects it, runs a remediation playbook, and writes the result back as a work note.

Event-Driven Ansible uses rulebooks — declarative YAML mapping sources (event producers) to conditions (rules) to actions (run a playbook, post to webhook, etc.).

The servicenow.itsm collection ships an EDA source plugin that subscribes to ServiceNow’s Table API change feed. A minimal rulebook:

# rulebooks/servicenow-incidents.yml
---
- name: ServiceNow incident remediation
  hosts: all
  sources:
    - servicenow.itsm.records:
        instance:
          host: "{{ SN_HOST }}"
          username: "{{ SN_USERNAME }}"
          password: "{{ SN_PASSWORD }}"
        table: incident
        query: "active=true^state=1^assignment_group.nameLIKEPlatform"
        interval: 30

  rules:
    - name: Disk full  run cleanup
      condition: |
        event.short_description is search("disk.*full|filesystem.*full", ignorecase=true)
        and event.priority in [1, 2, 3]
      action:
        run_job_template:
          name: "INC: Disk cleanup"
          organization: Default
          job_args:
            extra_vars:
              incident_number: "{{ event.number }}"
              target_host: "{{ event.cmdb_ci.display_value }}"

    - name: Service down  restart and verify
      condition: |
        event.short_description is search("service.*down|process.*not running", ignorecase=true)
        and event.priority in [1, 2]
      action:
        run_job_template:
          name: "INC: Service restart"
          organization: Default
          job_args:
            extra_vars:
              incident_number: "{{ event.number }}"
              target_host: "{{ event.cmdb_ci.display_value }}"
              service_name: "{{ event.short_description | regex_search('service\\s+(\\S+)', '\\1') | first }}"

    - name: Unknown high-priority incident  page on-call
      condition: |
        event.priority in [1, 2]
        and event.assignment_group.display_value == "Platform"
      action:
        post_event:
          event:
            type: pagerduty_trigger
            incident_number: "{{ event.number }}"
            severity: "{{ event.priority }}"
            description: "{{ event.short_description }}"

Activating this rulebook in EDA means: every 30 seconds, EDA polls ServiceNow for new high-priority incidents assigned to Platform; matching incidents trigger the right remediation job; unmatched ones page on-call.

The remediation playbook itself follows a strict contract:

---
- name: Remediate disk full incident
  hosts: "{{ target_host }}"
  gather_facts: true
  vars:
    incident_number: "{{ incident_number }}"
  tasks:

    - name: Acknowledge incident
      servicenow.itsm.incident:
        number: "{{ incident_number }}"
        state: in_progress
        work_notes: >
          AAP {{ tower_job_id }} starting auto-remediation at {{ ansible_date_time.iso8601 }}.
      delegate_to: localhost
      run_once: true

    - name: Find candidate paths to clean
      ansible.builtin.find:
        paths:
          - /var/log
          - /tmp
          - /var/cache
        age: 7d
        size: 100m
      register: cleanup_candidates

    - name: Compress old logs
      ansible.builtin.archive:
        path: "{{ item.path }}"
        dest: "{{ item.path }}.gz"
        format: gz
        remove: true
      loop: "{{ cleanup_candidates.files | selectattr('path', 'match', '.*\\.log$') | list }}"
      register: compressed

    - name: Re-check disk usage
      ansible.builtin.command: df -BG /
      register: df_after
      changed_when: false

    - name: Resolve incident
      servicenow.itsm.incident:
        number: "{{ incident_number }}"
        state: resolved
        close_code: "Solved (Permanently)"
        close_notes: |
          Auto-remediated by AAP job {{ tower_job_id }}.
          Compressed {{ compressed.results | length }} log files.
          Disk usage after cleanup:
          {{ df_after.stdout }}
      delegate_to: localhost
      run_once: true
      when: df_after.stdout is search("[0-7][0-9]%")

    - name: Escalate if still full
      servicenow.itsm.incident:
        number: "{{ incident_number }}"
        state: in_progress
        urgency: 1
        work_notes: |
          Auto-remediation insufficient. Disk still {{ (df_after.stdout | regex_search('(\\d+)%', '\\1')).0 }}% full.
          Escalating to on-call.
      delegate_to: localhost
      run_once: true
      when: df_after.stdout is not search("[0-7][0-9]%")

Key discipline points:

This pattern collapses MTTR for known-shape incidents from 20-40 minutes (page → ack → triage → fix → resolve) to 30-90 seconds. For an organisation with ~50 such incidents per week, that’s a real and measurable reduction in toil.

5.1 Closing the loop with problem records

Repeated incidents on the same CI within a window indicate a problem (in ITIL terms), not just incidents. A nice elaboration:

- name: Problem detection  count incidents on this CI in last 30 days
  servicenow.itsm.api:
    resource: incident
    action: get
    query_params:
      sysparm_query: >
        cmdb_ci={{ cmdb_sys_id }}^
        opened_at>=javascript:gs.daysAgoStart(30)^
        short_descriptionLIKEdisk full
  register: same_incidents
  delegate_to: localhost

- name: Open problem record if >3 incidents on same CI
  servicenow.itsm.problem:
    short_description: "Recurring disk full on {{ inventory_hostname }}"
    description: |
      {{ same_incidents.records | length }} disk-full incidents on this host in last 30 days.
      Auto-remediation working but treating symptom only.
      Likely cause: insufficient log rotation policy or runaway logging.
    cmdb_ci: "{{ cmdb_sys_id }}"
    impact: 2
    urgency: 2
  when: same_incidents.records | length > 3
  delegate_to: localhost
  run_once: true

Now the bot is not just fixing symptoms but flagging chronic root causes. Auditors love this. The “we automated remediation but never investigated the underlying problem” is one of the classic anti-patterns auditors look for, and this addresses it directly.


6. ChatOps: Slack & Teams as the human surface

The fourth pillar is making automation visible and approachable in chat. In a mature setup, an engineer types @kv-bot reboot prod-app-04 in #platform-ops and the bot:

  1. Recognises this is a production action
  2. Looks up prod-app-04 in the CMDB to find the responsible team
  3. Creates a standard CHG ticket
  4. Posts an interactive message in Slack/Teams: “🚨 Production reboot requested by @vinod for prod-app-04. Approve?”
  5. Routes the approval prompt to the on-call from the responsible team
  6. Once approved (in chat), runs the AAP job
  7. Streams progress back to the original thread
  8. Closes the CHG with the result

The Slack bot is itself an Ansible-driven service. The path:

Slack slash command / mention
    → Slack Events API webhook
        → AAP webhook receiver (or EDA webhook source)
            → AAP job template "ChatOps router"
                → Creates CHG, posts approval message, waits for response
                    → On approve: runs target job template
                    → On deny: posts denial reason

EDA’s ansible.eda.webhook source plugin is the entry point:

# rulebooks/chatops.yml
---
- name: ChatOps router
  hosts: all
  sources:
    - ansible.eda.webhook:
        host: 0.0.0.0
        port: 5000
        token: "{{ CHATOPS_WEBHOOK_TOKEN }}"

  rules:
    - name: Reboot command
      condition: |
        event.payload.command == "reboot"
        and event.payload.target is defined
        and event.payload.user_id is defined
      action:
        run_job_template:
          name: "ChatOps: Reboot"
          job_args:
            extra_vars:
              chat_user: "{{ event.payload.user_id }}"
              chat_channel: "{{ event.payload.channel_id }}"
              chat_thread: "{{ event.payload.thread_ts }}"
              target_host: "{{ event.payload.target }}"

    - name: Status command (read-only, no CHG)
      condition: event.payload.command == "status"
      action:
        run_job_template:
          name: "ChatOps: Status read-only"
          job_args:
            extra_vars:
              chat_channel: "{{ event.payload.channel_id }}"
              chat_thread: "{{ event.payload.thread_ts }}"
              target_host: "{{ event.payload.target }}"

The “ChatOps: Reboot” job template runs a playbook that:

---
- name: ChatOps reboot orchestrator
  hosts: localhost
  gather_facts: false
  tasks:

    - name: Verify target exists in CMDB
      servicenow.itsm.api:
        resource: cmdb_ci_server
        action: get
        query_params:
          sysparm_query: "name={{ target_host }}"
      register: ci_lookup

    - name: Fail if target unknown
      ansible.builtin.fail:
        msg: "Host '{{ target_host }}' not found in CMDB."
      when: ci_lookup.records | length == 0

    - name: Capture CI metadata
      ansible.builtin.set_fact:
        ci: "{{ ci_lookup.records[0] }}"

    - name: Check user is in approver list for this CI's environment
      ansible.builtin.uri:
        url: "{{ slack_webhook_url }}"
        method: POST
        body_format: json
        body:
          channel: "{{ chat_channel }}"
          thread_ts: "{{ chat_thread }}"
          text: >
            ❌ <@{{ chat_user }}> is not authorised to reboot
            {{ target_host }} ({{ ci.u_environment.display_value }}).
            Please ask {{ ci.support_group.display_value }} to file a CHG.
      when: ci.u_environment.display_value == 'production'
            and chat_user not in approved_chatops_users

    - name: Create standard CHG
      servicenow.itsm.change_request:
        type: standard
        template: "Standard - Server Reboot"
        short_description: "ChatOps reboot {{ target_host }}"
        cmdb_ci: "{{ ci.sys_id }}"
        requested_by: "{{ chat_user_email }}"
        assignment_group: "{{ ci.support_group.display_value }}"
        state: scheduled
        start_date: "{{ ansible_date_time.iso8601 }}"
        end_date: "{{ (ansible_date_time.iso8601 | as_datetime + 15*60) | iso8601 }}"
      register: chg

    - name: Post Slack message with approve/deny buttons
      community.general.slack:
        token: "{{ slack_bot_token }}"
        channel: "{{ chat_channel }}"
        thread_id: "{{ chat_thread }}"
        attachments:
          - text: >
              <@{{ chat_user }}> requested reboot of *{{ target_host }}*.
              CHG {{ chg.record.number }} created. Approve?
            color: warning
            actions:
              - type: button
                text:  Approve
                url: "https://aap.example.com/api/v2/job_templates/42/launch/?chg={{ chg.record.number }}&approve=true"
                style: primary
              - type: button
                text:  Deny
                url: "https://aap.example.com/api/v2/job_templates/42/launch/?chg={{ chg.record.number }}&approve=false"
                style: danger
      when: ci.u_environment.display_value == 'production'

    - name: Auto-approve and reboot for non-prod
      ansible.builtin.uri:
        url: "https://aap.example.com/api/v2/job_templates/43/launch/"
        method: POST
        body_format: json
        body:
          extra_vars:
            change_request_number: "{{ chg.record.number }}"
            target_host: "{{ target_host }}"
            chat_thread: "{{ chat_thread }}"
            chat_channel: "{{ chat_channel }}"
        headers:
          Authorization: "Bearer {{ aap_oauth_token }}"
      when: ci.u_environment.display_value != 'production'

What this gives operators:

The Teams equivalent uses adaptive cards with Action.Submit buttons that POST to the AAP webhook receiver. The pattern is identical; only the rendering primitive changes.

6.1 Read-only commands deserve their own pattern

Commands like @kv-bot status prod-app-04 should never create a CHG, never require approval, and should run as fast as possible. These are queries, not changes. The “ChatOps: Status read-only” job template uses a credential with read-only access to hosts and posts:

prod-app-04 (Linux RHEL 9.4, prod, payments_api)
  Uptime: 47 days
  Load: 0.32 / 0.41 / 0.38
  Memory: 14.2 GB / 32 GB used
  Disk /: 67%
  Last patched: 2026-05-14
  CHG history (30d): 4 changes, last CHG0098765 (2026-06-19)

This single message replaces five separate ServiceNow tab clicks. Engineers will thank you.


7. Failure modes and how to handle them

A few failure modes that will happen in production. Plan for them now, not at 4am.

Failure Symptom Mitigation
ServiceNow API down All jobs fail at CHG validation Fail-closed is correct here. Have a documented break-glass: a separate AAP credential that bypasses CHG for a 4-hour incident window, requires SecOps approval, and writes an INC retroactively
ServiceNow API rate-limited Random job failures with HTTP 429 Configure retry-with-backoff on all servicenow.itsm.* tasks: until: result is succeeded; retries: 5; delay: 30
CMDB inventory drift Hosts missing from inventory Schedule daily “CMDB hygiene” reports comparing AAP inventory against actual host responses; alert when drift > 5%
EDA rulebook crashes Incidents pile up unhandled Run two EDA replicas behind a load balancer; alert if rulebook activation status != “running” for > 5 min
Slack bot deleted from channel ChatOps approvals silently lost Bot must respond to its own @channel reload command and post weekly “I’m alive” health checks
Standard change template misconfigured Bot creates CHGs that auto-fail Lock standard change templates behind code review in ServiceNow Update Sets, and validate them in a UAT instance before promotion
ServiceNow OAuth token expired All jobs fail with 401 AAP credential injector should fetch fresh tokens via the client_credentials grant; rotate every 24h
Approver out of office Production CHGs sit blocked ServiceNow CAB rules should fall back to a backup approver group; document this in the runbook
Bot posts message but AAP webhook is down Approval click → silent failure Webhook receivers must respond within 3s with an ACK; the actual job runs async, with a Slack thread update on completion
Engineer types wrong CHG number Job correctly fails — but engineer doesn’t know why Slack bot’s failure messages must include a clickable ServiceNow link to the CHG state

Two non-obvious lessons from running this in production:

Lesson 1 — the “approval fatigue” trap. If you make every change require Slack approval, on-calls start clicking ✅ without reading. The fix: tier your operations. Read-only → no approval. Standard non-prod changes → no approval, just notification. Standard prod changes → approval but with a 2-line summary in the message. Non-standard prod changes → approval + link to the change record + 30-second cooling-off period before the button works. This last one prevents accidental clicks.

Lesson 2 — never let the bot become the bottleneck. Your Slack bot will go down at the worst possible time. There must always be a manual escape hatch: an AAP UI URL that any authorised engineer can open and run the job from. Teams that build “Slack-only” automation get held hostage by their bot. Make Slack a convenience layer, not a single point of failure.


8. Evidence trail and audit-readiness

The end-to-end evidence chain for any production change should look like this when an auditor asks:

Slack message #platform-ops 2026-06-22T14:03:11Z
  → @vinod typed "@kv-bot patch prod-db-01"
    → AAP webhook received (request_id: r-7a82b4)
      → ChatOps router job 84291 (template "ChatOps: Patch")
        → CMDB lookup confirmed prod-db-01 (sys_id: abc123)
        → CHG0102847 created (standard change, template "Patch Linux")
        → Slack approve/deny posted in thread, message_ts: 1718978591.0034
          → Approver @lina clicked Approve at 14:04:22Z
            → AAP job 84292 launched (template "Patch Linux Standard")
              → Pre-flight CHG validation: PASSED
              → Inventory: prod-db-01 (single-host)
              → Tasks executed: 47, changed: 12
              → Post-flight: evidence bundle uploaded to s3://kv-evidence/2026/06/22/job-84292.tar.gz
                → CHG0102847 transitioned to Review
                  → CHG0102847 closed-successful at 14:11:44Z
                    → Slack thread updated: "✅ Done in 7m 22s"

Every step has a timestamp, an actor, and a system of record. That is the chain auditors want to see, and once the wiring is in place it is produced automatically for every change. Quarterly audit prep collapses from a week of evidence-gathering to a 30-minute query.

A useful nightly compliance report:

- name: Nightly compliance report  CHG-to-job linkage
  hosts: localhost
  gather_facts: false
  tasks:
    - name: Get all AAP jobs from last 24h
      ansible.builtin.uri:
        url: "https://aap.example.com/api/v2/jobs/?finished__gte={{ (ansible_date_time.iso8601 | as_datetime - 24*3600) | iso8601 }}"
        headers:
          Authorization: "Bearer {{ aap_oauth_token }}"
      register: aap_jobs

    - name: Find jobs that ran without a CHG number
      ansible.builtin.set_fact:
        non_compliant_jobs: >-
          {{ aap_jobs.json.results
             | rejectattr('extra_vars', 'search', 'change_request_number')
             | rejectattr('job_template.name', 'in', read_only_templates)
             | list }}

    - name: Open INC for non-compliant runs
      servicenow.itsm.incident:
        short_description: "Non-compliant AAP job: {{ item.name }}"
        description: "Job {{ item.id }} ran without a CHG reference. Investigate."
        impact: 2
        urgency: 2
        category: governance
      loop: "{{ non_compliant_jobs }}"

This loop catches automation that escaped the gate. In a healthy environment, the report runs nightly and finds zero offenders for months at a time. The day it finds one, you get a real signal.


9. The minimum viable maturity ladder

If you’re starting from scratch, this is the order I recommend:

  1. Week 1-2: Stand up the servicenow.itsm collection in AAP, configure the OAuth credential, run a manual change_request_info against an existing CHG. Prove connectivity.
  2. Week 3-4: Build the CHG-gate pre-flight playbook. Apply it to one low-risk job template. Run a real change through it.
  3. Month 2: Add the post-flight closure block. Now your one job template is fully gated and self-closing.
  4. Month 3: Roll the gate out to all production-touching job templates. Resist exceptions. Track the percentage of prod jobs that go through the gate; it should be 100% within a quarter.
  5. Month 4: Stand up CMDB-as-inventory for non-production. Prove it works. Fix CMDB hygiene problems as they surface.
  6. Month 5-6: Graduate CMDB-as-inventory to prod once hygiene metrics are clean.
  7. Month 7-8: Build the first EDA remediation rulebook. Pick the simplest, highest-volume incident shape (disk full is the canonical choice). Measure MTTR before and after.
  8. Month 9-10: Roll out ChatOps for read-only commands. No approvals, no CHGs needed — instant value, near-zero risk.
  9. Month 11-12: Roll out ChatOps for standard changes. Now you have full bidirectional integration.

Trying to do all of this in one quarter is a known failure mode. The teams that succeed do it incrementally, with each step proving value before the next is started.


10. Where this fits in the broader Tier 5 picture

The compliance lesson (D1) gave you the what — STIG, CIS, OpenSCAP, signed evidence. The DR lesson (D2) gave you the when-it-all-goes-wrong response. This lesson gives you the day-to-day governance fabric — the wiring that ensures every routine change is authorised, observed, and recorded.

The remaining specialist lessons fill in the rest of the operational picture: backup automation (D8), database migrations (D9), and the observability capstone (D10) that ties metrics, logs, traces, and AAP events into a single Grafana view of “is automation healthy?”

When ITSM, ChatOps, compliance, DR, and observability are all in place, you have built what regulators call a demonstrably-controlled automation environment — one where every change is authorised, observed, recorded, reversible, and reviewable. That is the destination of this whole course. ITSM integration is the connective tissue that makes the other pieces auditable, and ChatOps is the human-shaped surface that keeps engineers actually using the system rather than working around it.

ansibleitsmservicenowchatopsevent-driven-ansibleaapcmdbslackteamschange-managementautomation-platform
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments