Ansible Lesson 38 of 42

Ansible for Edge & IoT Fleet Management, In Depth: Pull-Mode, Signed Manifests, Constrained Devices & Intermittent Networks

Ansible for Edge and IoT Fleet Management, In Depth — Pull-Mode, Signed Manifests, Constrained Devices and Intermittent Networks

Datacentre Ansible assumes warm prerequisites: every host reachable on a stable network, fast SSH, predictable hardware, plentiful CPU and disk. Edge Ansible assumes the opposite. Devices are scattered across thousands of locations, behind NAT and shared 4G modems, with 1GB of RAM and 8GB of flash, online for minutes a day. The “fleet” is 5,000 retail kiosks, 25,000 wind-turbine controllers, 100,000 vehicle telematics units, or 1.5 million water-meter gateways. Push-mode Ansible — open SSH from a control plane to every host on a schedule — is the wrong shape.

This lesson is the specialist guide to inverting the model: pull-based agents, signed manifests, fleet operators built on Kubernetes-at-the-edge (k3s, MicroK8s, KubeEdge), and the realistic operational patterns for constrained, intermittently-connected hardware. We will use ansible-pull, the Red Hat Device Edge / Image Mode for RHEL stack, and the ostree/bootc image-based update model where appropriate.

We will be opinionated about scale: at a hundred edge hosts, AAP push-mode still works. At a thousand, mesh execution nodes start to creak. At ten thousand and beyond, you must move to pull-mode and image-based updates; if you do not, you will brick devices in the field. The patterns here scale from one to one million.

Position in the curriculum. Tier 1–4 fluency required, plus the Tier 5 air-gapped lesson — many edge fleets are also air-gapped (industrial OT, aviation, defence). The compliance lesson informs the signing model.


What “edge” means and why it changes the rules

“Edge” covers a wide range; the relevant attributes for Ansible are:

Together these break every assumption datacentre Ansible makes:

Datacentre assumption Edge reality
Control plane connects out to host (push) Host must connect to control plane (pull)
SSH always reachable Device may be online for 4 minutes per day
Failed task = retry Failed task on a 4G link = wait until tomorrow
dnf install foo or apt install foo No network during the install window
Rollback = revert config Rollback = revert the entire OS image (atomic)
Privilege escalation via sudo Device root-of-trust signed firmware; no sudo at all
Inventory in CMDB updated by operators Inventory is a serial number on a sticker, scaling weekly

The mental shift is from “playbooks executed on hosts” to “hosts that pull and apply signed manifests.”


Three operational patterns at the edge

There are three patterns at scale; pick deliberately.

1. ansible-pull — the simplest pull-mode. Each device runs ansible-pull from cron / systemd timer, fetches a Git repo, runs the playbook locally. Works up to about 5k devices with discipline.

2. Image-based (ostree/bootc) — every change ships as a new bootable OS image. The device reboots into the new image (atomic) or rolls back (atomic) on failure. This is the Red Hat Device Edge and Fedora bootc / RHEL Image Mode model. Scales to millions; what big telcos and automakers actually use.

3. Kubernetes-at-the-edge — k3s/MicroK8s on each device, with a fleet operator (Rancher Fleet, Argo CD edge mode, EdgeX Foundry) that pulls and reconciles workloads. Most appropriate when the workload is itself container-based (ML models, data pipelines, video analytics).

In real fleets, you usually combine: image-based for the OS and platform layer (Pattern 2), Kubernetes for the application layer (Pattern 3), and ansible-pull for one-shot operational tasks (Pattern 1) when needed.


Pattern 1: ansible-pull for small to medium fleets

ansible-pull is the bare-minimum pull-mode runner. It clones a Git repo, runs a playbook locally with --connection=local, and exits. Configure on each device:

# /etc/systemd/system/ansible-pull.service
[Unit]
Description=ansible-pull
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/bin/ansible-pull \
  -U https://gitea.kv.local/edge/edge-fleet.git \
  -i localhost, \
  -C main \
  -d /var/lib/ansible-pull/repo \
  --vault-password-file /etc/ansible/vault.pwd \
  --verify-commit \
  edge-pull.yml

[Install]
WantedBy=multi-user.target
# /etc/systemd/system/ansible-pull.timer
[Unit]
Description=ansible-pull every 30 minutes (with jitter)

[Timer]
OnBootSec=2min
OnUnitActiveSec=30min
RandomizedDelaySec=10min
Persistent=true

[Install]
WantedBy=timers.target

Three details that matter:

The repo is a normal Ansible project with one entry-point playbook (edge-pull.yml) using --connection=local. Tasks gated by host facts (when: ansible_facts.hostname.startswith('kiosk-')) let you carve out groups without separate inventories.

For small fleets up to ~5,000 devices, this is enough. The Git server (Gitea, GitLab, or the AAP-bundled Hub for collections) is the only “infrastructure.” Add a webhook from the Git server to a metrics endpoint to track which devices have pulled which commit; that is your fleet status dashboard.


Signed manifests and rollback for ansible-pull

ansible-pull itself doesn’t roll back; if the playbook breaks, the device is broken. Engineer rollback explicitly:

# edge-pull.yml — checkpointed apply
---
- hosts: localhost
  connection: local
  tasks:

    - name: Read currently applied commit hash
      ansible.builtin.slurp:
        src: /var/lib/ansible-pull/applied_commit
      register: applied
      ignore_errors: true

    - name: Compute pending commit hash
      ansible.builtin.command:
        cmd: git -C /var/lib/ansible-pull/repo rev-parse HEAD
      register: pending
      changed_when: false

    - name: Snapshot before applying (btrfs)
      ansible.builtin.command:
        cmd: btrfs subvolume snapshot / /.snapshots/pre-{{ pending.stdout[:7] }}
      when: applied.content | default('') | b64decode | trim != pending.stdout
      ignore_errors: true   # not all devices have btrfs

    - name: Run the actual configuration role
      ansible.builtin.import_role:
        name: kiosk_configure

    - name: Smoke test
      ansible.builtin.import_role:
        name: kiosk_smoke

    - name: Persist applied commit (only if smoke test passed)
      ansible.builtin.copy:
        dest: /var/lib/ansible-pull/applied_commit
        content: "{{ pending.stdout }}\n"

    - name: Trim old snapshots to keep only last 3
      ansible.builtin.shell: |
        ls -1t /.snapshots | tail -n +4 | xargs -r -I {} btrfs subvolume delete /.snapshots/{}
      ignore_errors: true

Rollback is then a separate playbook triggered by a watchdog: if the device cannot reach the Git server for 24 hours and the smoke test is failing, the watchdog rolls back to the previous snapshot. This is the “deadman switch” pattern; it has saved more fleets than any other single mechanism.


Pattern 2: image-based updates with ostree / bootc

For fleets above ~5k devices, the right primitive is the bootable OS image. Instead of mutating the running OS (running dnf install on a Pi in the field), you publish a new immutable image, the device boots into it, and rolls back on failure. This is what Tesla, Volkswagen, Boeing, and every modern automotive/aerospace stack does at scale.

The Red Hat way is RHEL Image Mode (bootc), which uses OSTree as the on-device store and OCI container images as the build/distribution format. bootc is a small native runtime that knows how to switch between two deployments (current and pending) atomically.

# Containerfile — your edge OS image
FROM registry.redhat.io/rhel9/rhel-bootc:9.4

RUN dnf install -y \
      podman \
      systemd-container \
      ansible-core \
      python3-cryptography \
      kiosk-app && \
    dnf clean all

# Bake the kiosk app config into the image
COPY etc/kiosk/ /etc/kiosk/

# Enable services
RUN systemctl enable kiosk-app.service

# Bake the ansible pull config too
COPY etc/systemd/system/ansible-pull.service /etc/systemd/system/
COPY etc/systemd/system/ansible-pull.timer /etc/systemd/system/
RUN systemctl enable ansible-pull.timer

Build, sign, push:

podman build -t registry.kv.local/kiosk-os:1.4.2 -f Containerfile .
cosign sign --key /etc/cosign-edge.key registry.kv.local/kiosk-os:1.4.2
podman push registry.kv.local/kiosk-os:1.4.2

On each device:

bootc switch registry.kv.local/kiosk-os:1.4.2
bootc upgrade --check     # show what would change
bootc upgrade --apply     # stages new deployment
systemctl reboot          # boots into new deployment

# if anything is broken:
bootc rollback            # boots into previous deployment, atomic

The state machine: each device has two deployments on disk (current and pending). On reboot, GRUB boots the pending. If it fails to boot 3 times (or fails health checks within N minutes), GRUB falls back to the previous one automatically. You cannot brick a device with bootc, because the rollback is hardware-enforced via the firmware/bootloader. This is the property that makes the model viable at million-device scale.

Ansible’s role here is driving the image build, not running on each device:

# build-edge-image.yml — runs on the build host
- hosts: build_host
  tasks:
    - name: Render Containerfile from template
      ansible.builtin.template:
        src: Containerfile.j2
        dest: /tmp/build/Containerfile

    - name: Build image
      containers.podman.podman_image:
        name: "registry.kv.local/kiosk-os"
        tag: "{{ release }}"
        path: /tmp/build
        push: true
        push_args:
          dest: "registry.kv.local/kiosk-os:{{ release }}"

    - name: Sign with cosign
      ansible.builtin.command:
        cmd: cosign sign --key /etc/cosign-edge.key registry.kv.local/kiosk-os:{{ release }}

    - name: Update fleet rollout config
      ansible.builtin.uri:
        url: https://fleet.kv.local/api/rollouts
        method: POST
        body_format: json
        body:
          target: kiosks
          image: "registry.kv.local/kiosk-os:{{ release }}"
          wave: canary
          percentage: 1

Then a fleet operator (Rancher Fleet, Argo CD edge, custom controller) watches the rollout config and pushes the device-side bootc switch commands.


Pattern 3: Kubernetes-at-the-edge with fleet operators

For workloads that are themselves container-based — ML inference, video analytics, data pipelines — running k3s on each device and using a fleet operator is often the cleanest pattern.

k3s is a 60MB Kubernetes distribution that runs comfortably on 1GB-RAM ARM64 boxes. Each device runs a single-node k3s cluster (or a 3-node mini-cluster across nearby devices); workloads are pods deployed via a fleet operator.

The Ansible role is to:

# install k3s during image build (Pattern 2 + Pattern 3)
- name: Install k3s
  ansible.builtin.shell: |
    curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.30.5+k3s1 sh -s - \
      --token={{ vault_k3s_token }} \
      --server=https://fleet-control.kv.local:6443

The fleet manifest is GitOps:

# fleet/kiosk-app/fleet.yaml
namespace: kiosk
helm:
  chart: oci://registry.kv.local/charts/kiosk-app
  version: 1.4.2

targets:
  - name: canary
    clusterSelector:
      matchLabels:
        wave: canary
  - name: prod
    clusterSelector:
      matchLabels:
        wave: prod

Devices labelled wave: canary get the new chart first; once metrics confirm health, you bump wave: prod to the same version. The label change is tracked in Git; the rollback is git revert.

Ansible orchestrates the rollout (e.g., flipping labels in waves) but does not need to reach each device directly. The control plane connects out; the device pulls in.


Connectivity reality: NAT, MQTT, and one-way trust

Most edge devices live behind carrier NAT (4G/5G/cable), which means no inbound connectivity from the control plane. Three viable bidirectional channels:

Each has trade-offs. HTTPS pull is the simplest but high latency (your minimum response time is one poll interval). MQTT is real-time but adds a broker dependency. WebSocket gives RPC semantics but is heavyweight for very small devices.

Pick one and standardise. Mixing transports per device class makes the control plane brittle; uniform transport with class-specific topics/labels makes it manageable.

For the AAP-aware reader: AAP automation mesh nodes are not viable as edge agents. They are too heavy, too tightly coupled to the controller, and require always-on network. Edge devices need the patterns above; AAP can still be the operator-facing UI that triggers fleet rollouts via webhooks.


Identity and trust at the edge

Each device must be uniquely identifiable and authenticate cryptographically. The chain:

  1. Hardware root of trust — TPM 2.0 module on the board, with a manufacturer-issued Endorsement Key (EK).
  2. Device certificate — issued at provisioning by an internal CA, bound to the TPM’s EK or AIK. Stored in TPM-protected NVRAM.
  3. Workload secrets — short-lived tokens issued by the control plane after device cert validation. Rotated automatically.

The Ansible workflow:

# device-provision.yml — runs on the build host or first-boot enrolment
- name: Generate device certificate via internal CA
  community.crypto.x509_certificate:
    path: "/var/lib/devices/{{ device_serial }}.crt"
    privatekey_path: "/var/lib/devices/{{ device_serial }}.key"
    csr_path: "/var/lib/devices/{{ device_serial }}.csr"
    provider: ownca
    ownca_path: /etc/pki/edge-ca/ca.crt
    ownca_privatekey_path: /etc/pki/edge-ca/ca.key
    ownca_privatekey_passphrase: "{{ vault_ca_passphrase }}"
    ownca_not_after: "+1095d"   # 3 years
  no_log: true

- name: Bake cert into device image at first boot
  ansible.builtin.copy:
    src: "/var/lib/devices/{{ device_serial }}.crt"
    dest: "/etc/pki/device/cert.pem"
  delegate_to: "{{ device_address }}"
  no_log: true

Workload secrets (S3 keys, MQTT credentials) are not stored on the device; the device authenticates with its certificate and the control plane returns short-lived credentials each session. This pattern is what makes a stolen-device scenario manageable: revoke the cert, rotate workload creds, the lost device is locked out within minutes.


OTA security considerations

Any OTA update path is also an attack path. Defenses:

Auditors love the signed-everything story. Auditors hate the “we trust whatever Git pushes” story. Make the right choice early.


Constrained-device pragmatics

A 1GB-RAM device cannot run an Ansible execution environment full of Python deps. Strategies:

For genuinely tiny devices (microcontrollers under 256KB), Ansible is the wrong tool — you ship pre-built firmware over OTA frameworks like Mender, RAUC, or Zephyr’s MCUboot. Ansible lives one layer up: provisioning the gateway that talks to those firmware updaters.


Fleet operator: state machine and dashboards

The fleet operator (whether a Rancher Fleet, custom controller, or AAP workflow with EDA) tracks each device’s state machine:

[unenrolled] -- enrol --> [pending] -- ack --> [active]
                                                  |
                          rollout (canary) -------|
                                                  v
                          [pending-update] -- apply ok --> [active@new]
                                            -- apply fail --> [active@old]   (rollback)
                                                  |
                          retire ----------------|--> [retired]

Each transition emits a metric (Prometheus counter) and an event (Kafka, MQTT, or simple HTTP webhook). The dashboard answers:

Wave management is critical at scale: 1% canary for 24 hours, 5% wave for 48 hours, 25% wave for 72 hours, then full rollout. Bake this into the fleet operator config; do not let release managers override it manually.


Anti-patterns that destroy edge fleets


Frequently asked questions

1. When is ansible-pull enough vs when do I need image-based updates? Up to ~5k stable devices with reliable network and a config-only change set: ansible-pull is enough. Beyond that, or any case where you need to update kernel / glibc / firmware atomically: move to bootc/ostree image-based.

2. Can I run AAP at the edge? No. AAP is a datacentre product — controller, hub, automation mesh assume always-on network and ample resources. Use AAP as the back-of-house operator UI for the fleet operator (trigger rollouts, run reports), not as a runtime on each device.

3. How do I pin Ansible content for edge? Build the EE once on the build host with all collection versions pinned and signed. Bake the EE into the device image (Pattern 2). Devices never resolve Galaxy, never pip install, never reach out for content at runtime.

4. What about bandwidth costs? A bootc upgrade typically transfers only the layers that changed (OCI delta). For a 200MB-app-change on a 4GB-image OS, the transfer is ~50MB. With waves and randomised jitter, fleet bandwidth at peak is manageable. Always cap concurrent rollouts per gateway, and prefer cellular off-peak windows for non-urgent updates.

5. How do I handle 4G outages during update? The bootc upgrade process is resumable: the OCI fetcher caches partial layers. If connectivity drops mid-fetch, the next attempt resumes. Atomic apply happens only after full fetch + verification.

6. What about OT / industrial devices that cannot run Ansible at all? You don’t run Ansible on them. You run Ansible on the gateway that aggregates their data (the “edge gateway”, which is a Linux box). The gateway does protocol translation (Modbus/OPC-UA → MQTT) and applies updates to the OT devices via vendor-specific tooling. Ansible owns the gateway; vendor tooling owns the deepest leaf.

7. Are there security implications to using cosign / sigstore at the edge? Yes — the device must trust the cosign public keys, and the OCSP/CRL infrastructure must be reachable for revocation checks. Bake the public key bundle into the image; for revocation, use short-lived certs (renew every 24h) and reject expired without OCSP lookup.

8. How do I handle a fleet of mixed device types and OSes? Tag the inventory by capability: arch=arm64, ram=2g, network=4g, os=rhel9, class=kiosk. Roles use when guards on tags. Image builds produce per-class images. Fleet operator targets devices by tag, not by serial number — a serial-number-by-serial-number rollout does not scale.

9. What’s the right way to monitor an edge fleet? Each device emits a small metrics blob to Prometheus pushgateway / OTel collector / MQTT topic on every check-in. The control plane aggregates by version, wave, region. Anomaly detection (a device that hasn’t checked in for 7 days) triggers the maintenance ticket flow. Avoid scraping every device; that defeats the pull-mode pattern.

10. What’s the single most underrated edge practice? The deadman watchdog rollback. Every edge device should have a small program that monitors “have I successfully completed health checks recently?” and triggers a bootc rollback (or git checkout previous) if the answer is no. Without this, a regression deployed during your 4-minute connectivity window can leave a device unrecoverable until physical service. With it, the worst case is a slightly-out-of-date device.


Hands-on lab — ansible-pull with signed commits

This lab simulates a small edge fleet with a single device pulling a signed playbook from a local Git server.

Prerequisites: Linux box (or Pi), git, gpg, ansible-core ≥ 2.16.

mkdir -p edge-lab/{repo,device}
cd edge-lab

# 1. Local git repo with a tiny playbook
git init repo
cat > repo/edge-pull.yml << 'EOF'
- hosts: localhost
  connection: local
  tasks:
    - debug:
        msg: "Edge pull at {{ ansible_date_time.iso8601 }}, commit {{ lookup('env','COMMIT') | default('unknown') }}"
    - copy:
        dest: /tmp/edge-status
        content: |
          last-pull: {{ ansible_date_time.iso8601 }}
          host: {{ ansible_facts.hostname }}
EOF
( cd repo && git add edge-pull.yml && git commit -S -m "v1: tiny edge play" )

# 2. Set up signing key
gpg --quick-gen-key 'edge-signer@kv.local' rsa4096 sign 1y
git -C repo config user.signingkey 'edge-signer@kv.local'

# 3. The device side: configure ansible-pull
cat > device/pull.sh << 'EOF'
#!/bin/bash
set -e
COMMIT=$(git -C /tmp/edge-repo rev-parse HEAD || echo unset)
exec ansible-pull \
  -U ${PWD}/repo \
  -i localhost, \
  -d /tmp/edge-repo \
  --verify-commit \
  edge-pull.yml
EOF
chmod +x device/pull.sh

# 4. Run it
device/pull.sh
cat /tmp/edge-status

# 5. Try to push an unsigned commit and see the verify-commit fail
( cd repo && git commit --allow-empty -m "unsigned" )
device/pull.sh   # should fail because tip is unsigned

The lab proves the signing chain end-to-end: the device pulls only commits signed by a trusted key. Extend it: add a systemd timer in device/, add a snapshot/rollback step in the playbook, add a watchdog that rolls back if /tmp/edge-status is older than N minutes.


Glossary


Certification mapping


Next steps

You now have an opinionated architecture for managing real edge fleets at any scale. The remaining specialist lessons cover:

If you only take one habit from this lesson: never mutate a running edge OS. Every change ships as a new signed image; every rollback is automatic and atomic. Once that property holds, every other edge problem becomes solvable.

ansibleedgeiotansible-pulldevice-edgefleet-managementk3sconstrained-devices
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments