Shell Lesson 33 of 42

Shell Bootstrap & cloud-init: Scripts That Run Before Any Package Manager, Network, or User Exists — POSIX-Strict Provisioning From First Boot

Why Bootstrap Is Its Own Discipline

You’ve written hundreds of shell scripts. Then you’re handed: “write the script that turns a freshly-booted Linux VM into a member of our cluster.” You write the same code you usually write — curl -fsSL https://api.example.com/..., jq -r '.token', apt-get install -y nginx — and 30% of the boots fail with errors you’ve never seen:

Bootstrap scripts run in a hostile environment because the system isn’t done assembling itself yet. The shell exists, PID 1 is running, but most of the userland tools you depend on either don’t exist or aren’t ready.

The disciplines that matter:

Discipline Why
POSIX-strict, busybox-compatible Your script may run on Alpine where [[ ]], <<<, and arrays don’t exist
Wait-for-X loops with timeouts Network, DNS, package manager locks, services — all may not be ready
Detect-then-act distro detection apt-get on Ubuntu, dnf on RHEL, apk on Alpine — no universal package CLI
Idempotent from zero Cloud may re-run user-data; running again must not break
No external dependencies on first run Don’t curl https://... for code; embed it inline or fetch from instance metadata
Fail loud and recoverable A failed bootstrap should leave clear logs and not produce a half-configured host

This lesson is the cross-distro pattern set: cloud-init anatomy, the metadata-service contract, network-up detection, busybox-safe shell, and a copy-pasteable bootstrap template.

cloud-init: The Bootstrap Framework You Already Have

cloud-init is the de facto bootstrap framework on AWS, Azure, GCP, OpenStack, and bare metal. When a VM boots, cloud-init reads “user-data” supplied by the platform and acts on it. User-data can be:

cloud-init runtime stages

                   BOOT
                    │
                    ▼
        ┌──────────────────────────┐
        │  cloud-init local        │  before networking
        │  (datasource, hostname)  │
        └────────────┬─────────────┘
                     │
                     ▼
        ┌──────────────────────────┐
        │  cloud-init init         │  network is up
        │  (resize disks, ssh keys)│
        └────────────┬─────────────┘
                     │
                     ▼
        ┌──────────────────────────┐
        │  cloud-init config       │  modules: write_files, runcmd, etc.
        │  (apt sources, packages) │
        └────────────┬─────────────┘
                     │
                     ▼
        ┌──────────────────────────┐
        │  cloud-init final        │  user_data runs here (shell scripts)
        │  (runcmd, scripts, etc.) │
        └────────────┬─────────────┘
                     │
                     ▼
                  READY

User-data scripts run during the final stage — after networking is configured, after package sources are set up, but before the system is fully “ready” for users. You’re root, you have network, you have a stable hostname, but other services may still be starting.

A minimal user-data shell script

#!/bin/sh
# cloud-init user-data: bootstrap-v1
# Runs ONCE on first boot. Output: /var/log/cloud-init-output.log

set -eu  # POSIX-strict; no -o pipefail (not POSIX)
exec >> /var/log/bootstrap.log 2>&1
echo "[$(date -u +%FT%TZ)] bootstrap starting"

# ... your bootstrap work ...

echo "[$(date -u +%FT%TZ)] bootstrap complete"

Three notes:

cloud-config: declarative bootstrap

For straightforward cases, cloud-config YAML is more reliable than shell scripts:

#cloud-config
hostname: web-001
fqdn: web-001.prod.internal
manage_etc_hosts: true

users:
  - name: deploy
    groups: sudo
    shell: /bin/bash
    sudo: 'ALL=(ALL) NOPASSWD:ALL'
    ssh_authorized_keys:
      - ssh-ed25519 AAAA... deploy@example

write_files:
  - path: /etc/myapp/config.json
    permissions: '0640'
    owner: 'root:myapp'
    content: |
      {"port": 8080, "log_level": "info"}

package_update: true
package_upgrade: false   # don't auto-upgrade in production; pin versions
packages:
  - curl
  - jq
  - chrony

runcmd:
  - systemctl enable --now chrony
  - /opt/myapp/bin/post-install.sh

cloud-config is declarative and idempotent by design. Use it for the static parts (users, packages, files); reserve shell scripts for dynamic logic that cloud-config can’t express.

Multi-part user-data

For complex bootstraps, combine cloud-config and shell:

Content-Type: multipart/mixed; boundary="===PART==="
MIME-Version: 1.0

--===PART===
Content-Type: text/cloud-config

#cloud-config
packages: [jq, curl]

--===PART===
Content-Type: text/x-shellscript

#!/bin/sh
set -eu
echo "shell stage"
# ... your provisioning ...

--===PART===--

This runs the cloud-config first (installing jq, curl), then your shell script with those tools available. Generated easily with cloud-init devel make-mime:

cloud-init devel make-mime \
  -a packages.cfg:cloud-config \
  -a bootstrap.sh:x-shellscript \
  > combined.mime

The Metadata Service: Bootstrap-Time Configuration

Each cloud platform exposes an HTTP metadata service at a well-known address that VMs can query for instance-specific data: hostname, IP, region, IAM credentials, user-supplied tags, and arbitrary user-data.

Cloud Endpoint Token required?
AWS http://169.254.169.254/latest/meta-data/ IMDSv2 requires PUT to get a token
Azure http://169.254.169.254/metadata/instance?api-version=2021-02-01 Header: Metadata: true
GCP http://metadata.google.internal/computeMetadata/v1/ Header: Metadata-Flavor: Google

AWS IMDSv2 (the modern, secured version)

# Get a session token (valid 6 hours).
TOKEN=$(curl -fsS -X PUT 'http://169.254.169.254/latest/api/token' \
  -H 'X-aws-ec2-metadata-token-ttl-seconds: 21600')

# Use it.
INSTANCE_ID=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/instance-id)
REGION=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/placement/region)
ROLE=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/iam/security-credentials/)

# Get IAM credentials for the role.
CREDS=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
  "http://169.254.169.254/latest/meta-data/iam/security-credentials/$ROLE")
# CREDS is JSON: { AccessKeyId, SecretAccessKey, Token, Expiration, ... }

IMDSv1 (no token) is being phased out. Always use IMDSv2 in new bootstrap scripts.

Azure metadata

INSTANCE=$(curl -fsS -H 'Metadata: true' \
  'http://169.254.169.254/metadata/instance?api-version=2021-02-01')

# Parse with jq if available; otherwise sed.
VM_NAME=$(echo "$INSTANCE" | jq -r '.compute.name')
REGION=$(echo "$INSTANCE" | jq -r '.compute.location')

For managed-identity tokens:

TOKEN=$(curl -fsS -H 'Metadata: true' \
  'http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https://management.azure.com/' \
  | jq -r '.access_token')

GCP metadata

INSTANCE=$(curl -fsS -H 'Metadata-Flavor: Google' \
  'http://metadata.google.internal/computeMetadata/v1/instance/?recursive=true')

ZONE=$(curl -fsS -H 'Metadata-Flavor: Google' \
  'http://metadata.google.internal/computeMetadata/v1/instance/zone' \
  | awk -F/ '{print $NF}')

# Custom metadata key:
DEPLOY_ENV=$(curl -fsS -H 'Metadata-Flavor: Google' \
  'http://metadata.google.internal/computeMetadata/v1/instance/attributes/deploy-env')

# IAM identity token (for OIDC auth):
TOKEN=$(curl -fsS -H 'Metadata-Flavor: Google' \
  'http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/identity?audience=https://my-service')

The metadata-service-as-secret-store pattern

Bootstrap scripts often need configuration that varies per-instance: which database to connect to, which API endpoint, what role to assume. Don’t bake these into AMIs; pass them via metadata:

# AWS launch template sets user-data including custom data:
# {"db_endpoint": "prod-db.us-west-2.internal", "feature_flags": ["x", "y"]}

USER_DATA=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/user-data)

# Or instance tags via API (requires IAM):
TAGS=$(aws ec2 describe-tags \
  --filters "Name=resource-id,Values=$INSTANCE_ID" \
  --query 'Tags[].[Key,Value]' --output text)

For secrets that shouldn’t be in user-data (because user-data is sometimes logged), fetch from AWS Secrets Manager / Azure Key Vault / GCP Secret Manager during bootstrap, using IAM that the metadata service makes available.

Wait-For-X Patterns: Don’t Race the System

Bootstrap is racy. Things that “should be there” might not be yet. The discipline: never assume; wait with a timeout.

Wait for network connectivity

wait_for_network() {
  timeout=${1:-60}
  i=0
  while [ "$i" -lt "$timeout" ]; do
    # POSIX: do not use [[ ]] or arrays.
    if getent hosts deb.debian.org >/dev/null 2>&1 || \
       getent hosts amazon.com >/dev/null 2>&1 || \
       getent hosts google.com >/dev/null 2>&1; then
      return 0
    fi
    sleep 1
    i=$((i + 1))
  done
  echo "wait_for_network: timed out after ${timeout}s" >&2
  return 1
}

wait_for_network 60 || exit 1

Why three hosts? Because DNS-up + one specific host might be unreachable for unrelated reasons (a peering issue, a firewall rule). Three independent zones means “the internet is generally working.”

ping is the wrong primitive: ICMP is often blocked. DNS resolution + connection attempt is more reliable.

Wait for package-manager lock to release

wait_for_apt() {
  timeout=${1:-300}
  i=0
  while [ "$i" -lt "$timeout" ]; do
    if ! pgrep -x apt-get >/dev/null 2>&1 && ! pgrep -x dpkg >/dev/null 2>&1; then
      # No apt process running.
      return 0
    fi
    sleep 2
    i=$((i + 2))
  done
  return 1
}

wait_for_apt 300 || { echo "apt-get is busy; aborting"; exit 1; }
apt-get update

cloud-init runs apt itself in parallel during the config stage. If your final-stage script also runs apt, you race. Wait for the lock.

For dnf/yum: pgrep dnf|yum. For apk: pgrep apk.

Wait for systemd to finish booting

wait_for_systemd_running() {
  timeout=${1:-120}
  i=0
  while [ "$i" -lt "$timeout" ]; do
    state=$(systemctl is-system-running 2>/dev/null || true)
    case "$state" in
      running|degraded) return 0 ;;
    esac
    sleep 2
    i=$((i + 2))
  done
  return 1
}

wait_for_systemd_running

systemctl is-system-running returns:

degraded is acceptable for “system is functional”; it just means some non-critical service didn’t start.

Wait for a specific service

wait_for_service() {
  service=$1
  timeout=${2:-60}
  i=0
  while [ "$i" -lt "$timeout" ]; do
    if systemctl is-active --quiet "$service" 2>/dev/null; then
      return 0
    fi
    sleep 1
    i=$((i + 1))
  done
  return 1
}

wait_for_service docker 30 || exit 1

Distro Detection: One Detection, Three Code Paths

detect_distro() {
  if [ -r /etc/os-release ]; then
    # /etc/os-release is the modern standard.
    . /etc/os-release   # exports ID, ID_LIKE, VERSION_ID, etc.
    case "$ID" in
      ubuntu|debian) DISTRO=debian; PKG=apt-get ;;
      rhel|centos|fedora|rocky|almalinux|amzn) DISTRO=rhel; PKG=dnf ;;
      alpine) DISTRO=alpine; PKG=apk ;;
      arch) DISTRO=arch; PKG=pacman ;;
      *)
        # Fall back to ID_LIKE for derivatives.
        case "$ID_LIKE" in
          *debian*) DISTRO=debian; PKG=apt-get ;;
          *rhel*|*fedora*) DISTRO=rhel; PKG=dnf ;;
          *) DISTRO=unknown; PKG="" ;;
        esac
        ;;
    esac
  else
    DISTRO=unknown; PKG=""
  fi

  # dnf may not exist on older RHEL/CentOS 7; fall back to yum.
  if [ "$PKG" = "dnf" ] && ! command -v dnf >/dev/null 2>&1; then
    PKG=yum
  fi

  export DISTRO PKG
}

detect_distro
echo "Detected: $DISTRO using $PKG"

/etc/os-release is supported on every modern Linux; it’s the canonical source. Older systems had /etc/redhat-release, /etc/alpine-release, etc. — fall back to those if needed.

Cross-distro install function

install_pkg() {
  pkg=$1
  case "$DISTRO" in
    debian)
      DEBIAN_FRONTEND=noninteractive apt-get install -y "$pkg"
      ;;
    rhel)
      "$PKG" install -y "$pkg"
      ;;
    alpine)
      apk add --no-cache "$pkg"
      ;;
    arch)
      pacman -Sy --noconfirm "$pkg"
      ;;
    *)
      echo "install_pkg: unknown distro $DISTRO" >&2
      return 1
      ;;
  esac
}

install_pkg jq
install_pkg curl

Note DEBIAN_FRONTEND=noninteractive for apt: prevents prompts for things like grub config that hang the bootstrap.

A Cross-Distro Bootstrap Template

#!/bin/sh
# bootstrap.sh — first-boot provisioning, POSIX-strict.
# Usable as cloud-init user-data on AWS, Azure, GCP, bare metal.

set -eu
exec >> /var/log/bootstrap.log 2>&1

log() { echo "[$(date -u +%FT%TZ)] [bootstrap] $*"; }

# ─── 0. Idempotency guard ─────────────────────────────────────────────────
MARKER=/var/lib/bootstrap/done.v1
if [ -f "$MARKER" ]; then
  log "already bootstrapped at $(cat "$MARKER"); skipping"
  exit 0
fi

# ─── 1. Wait for system to be ready ───────────────────────────────────────
log "waiting for systemd to settle..."
i=0
while [ "$i" -lt 60 ]; do
  state=$(systemctl is-system-running 2>/dev/null || true)
  case "$state" in
    running|degraded) break ;;
  esac
  sleep 2
  i=$((i + 2))
done

log "waiting for network..."
i=0
while [ "$i" -lt 60 ]; do
  if getent hosts deb.debian.org >/dev/null 2>&1 \
     || getent hosts amazon.com >/dev/null 2>&1 \
     || getent hosts google.com >/dev/null 2>&1; then
    break
  fi
  sleep 1
  i=$((i + 1))
done

# ─── 2. Distro detection ──────────────────────────────────────────────────
. /etc/os-release
case "$ID" in
  ubuntu|debian) DISTRO=debian ;;
  rhel|centos|fedora|rocky|almalinux|amzn) DISTRO=rhel ;;
  alpine) DISTRO=alpine ;;
  *) log "unknown distro: $ID"; exit 1 ;;
esac
log "detected: $DISTRO ($PRETTY_NAME)"

# ─── 3. Wait for package manager lock ─────────────────────────────────────
log "waiting for package manager lock..."
i=0
while [ "$i" -lt 300 ]; do
  case "$DISTRO" in
    debian)
      pgrep -x apt-get >/dev/null 2>&1 || pgrep -x dpkg >/dev/null 2>&1 || break
      ;;
    rhel)
      pgrep -x dnf >/dev/null 2>&1 || pgrep -x yum >/dev/null 2>&1 || break
      ;;
    alpine)
      pgrep -x apk >/dev/null 2>&1 || break
      ;;
  esac
  sleep 2
  i=$((i + 2))
done

# ─── 4. Install base packages ─────────────────────────────────────────────
log "installing base packages..."
case "$DISTRO" in
  debian)
    DEBIAN_FRONTEND=noninteractive apt-get update
    DEBIAN_FRONTEND=noninteractive apt-get install -y \
      curl jq ca-certificates chrony
    ;;
  rhel)
    yum install -y curl jq ca-certificates chrony 2>/dev/null \
      || dnf install -y curl jq ca-certificates chrony
    ;;
  alpine)
    apk add --no-cache curl jq ca-certificates chrony bash
    ;;
esac

# ─── 5. Fetch metadata (cloud-specific) ───────────────────────────────────
log "fetching instance metadata..."
fetch_metadata_aws() {
  TOKEN=$(curl -fsS -X PUT 'http://169.254.169.254/latest/api/token' \
    -H 'X-aws-ec2-metadata-token-ttl-seconds: 21600' || true)
  if [ -n "${TOKEN:-}" ]; then
    INSTANCE_ID=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
      http://169.254.169.254/latest/meta-data/instance-id || true)
    REGION=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
      http://169.254.169.254/latest/meta-data/placement/region || true)
    log "AWS: instance=$INSTANCE_ID region=$REGION"
    echo "$INSTANCE_ID" > /etc/instance-id
    echo "$REGION" > /etc/region
  fi
}

fetch_metadata_azure() {
  if curl -fsS -H 'Metadata: true' \
    'http://169.254.169.254/metadata/instance?api-version=2021-02-01' >/tmp/azure.json; then
    VM_NAME=$(jq -r '.compute.name' /tmp/azure.json)
    REGION=$(jq -r '.compute.location' /tmp/azure.json)
    log "Azure: vm=$VM_NAME region=$REGION"
    echo "$VM_NAME" > /etc/instance-id
    echo "$REGION" > /etc/region
  fi
}

fetch_metadata_gcp() {
  if curl -fsS -H 'Metadata-Flavor: Google' \
    http://metadata.google.internal/computeMetadata/v1/instance/id >/tmp/gcp.id; then
    INSTANCE_ID=$(cat /tmp/gcp.id)
    ZONE=$(curl -fsS -H 'Metadata-Flavor: Google' \
      http://metadata.google.internal/computeMetadata/v1/instance/zone \
      | awk -F/ '{print $NF}')
    log "GCP: instance=$INSTANCE_ID zone=$ZONE"
    echo "$INSTANCE_ID" > /etc/instance-id
    echo "$ZONE" > /etc/region
  fi
}

# Detect cloud and fetch.
if curl -fsS --max-time 1 'http://169.254.169.254/latest/meta-data/' \
   -H 'X-aws-ec2-metadata-token: 1' >/dev/null 2>&1 \
   || curl -fsS --max-time 1 -X PUT 'http://169.254.169.254/latest/api/token' \
   -H 'X-aws-ec2-metadata-token-ttl-seconds: 60' >/dev/null 2>&1; then
  fetch_metadata_aws
elif curl -fsS --max-time 1 -H 'Metadata: true' \
  'http://169.254.169.254/metadata/instance?api-version=2021-02-01' >/dev/null 2>&1; then
  fetch_metadata_azure
elif curl -fsS --max-time 1 -H 'Metadata-Flavor: Google' \
  http://metadata.google.internal/computeMetadata/v1/instance/id >/dev/null 2>&1; then
  fetch_metadata_gcp
else
  log "no recognizable metadata service; running on bare metal?"
fi

# ─── 6. Configure timekeeping ─────────────────────────────────────────────
log "starting chrony..."
systemctl enable --now chrony chronyd 2>/dev/null || \
  systemctl enable --now chronyd 2>/dev/null || \
  systemctl enable --now chrony 2>/dev/null || true

# ─── 7. Create deploy user ────────────────────────────────────────────────
log "creating deploy user..."
if ! id -u deploy >/dev/null 2>&1; then
  useradd --system --create-home --shell /bin/bash deploy
  install -d -m 0700 -o deploy -g deploy /home/deploy/.ssh
fi

# ─── 8. Pull and run the post-bootstrap configuration ─────────────────────
log "fetching post-bootstrap configuration..."
mkdir -p /opt/bootstrap
if [ -f /etc/instance-id ]; then
  curl -fsSL --retry 3 --retry-delay 5 \
    "https://config.example.com/$(cat /etc/instance-id)/post-install.sh" \
    -o /opt/bootstrap/post-install.sh
  chmod +x /opt/bootstrap/post-install.sh
  /opt/bootstrap/post-install.sh
fi

# ─── 9. Mark complete ─────────────────────────────────────────────────────
mkdir -p "$(dirname "$MARKER")"
date -u +%FT%TZ > "$MARKER"
log "bootstrap complete"

Real-World Recipes

Recipe 1: Inject SSH keys from instance metadata

# AWS: SSH keys are at /latest/meta-data/public-keys/
inject_aws_ssh_keys() {
  TOKEN=$(curl -fsS -X PUT 'http://169.254.169.254/latest/api/token' \
    -H 'X-aws-ec2-metadata-token-ttl-seconds: 60')
  KEY_INDEXES=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
    http://169.254.169.254/latest/meta-data/public-keys/ \
    | awk -F= '{print $1}')
  for idx in $KEY_INDEXES; do
    curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
      "http://169.254.169.254/latest/meta-data/public-keys/$idx/openssh-key" \
      >> /home/deploy/.ssh/authorized_keys
  done
  chmod 0600 /home/deploy/.ssh/authorized_keys
  chown deploy:deploy /home/deploy/.ssh/authorized_keys
}

Recipe 2: Bootstrap a Kubernetes node

# Install containerd and kubelet on a fresh Ubuntu host.
bootstrap_k8s_node() {
  apt-get update
  apt-get install -y curl ca-certificates apt-transport-https

  # containerd
  install -m 0755 -d /etc/apt/keyrings
  curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
    | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
  cat >/etc/apt/sources.list.d/docker.list <<EOF
deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable
EOF
  apt-get update
  apt-get install -y containerd.io
  containerd config default >/etc/containerd/config.toml
  systemctl restart containerd

  # kubelet, kubeadm
  curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.30/deb/Release.key \
    | gpg --dearmor -o /etc/apt/keyrings/kubernetes.gpg
  echo "deb [signed-by=/etc/apt/keyrings/kubernetes.gpg] \
    https://pkgs.k8s.io/core:/stable:/v1.30/deb/ /" \
    > /etc/apt/sources.list.d/kubernetes.list
  apt-get update
  apt-get install -y kubelet kubeadm kubectl
  apt-mark hold kubelet kubeadm kubectl

  # Disable swap (kubelet requirement).
  swapoff -a
  sed -i '/swap/s/^/#/' /etc/fstab

  # Sysctl tuning for k8s.
  cat >/etc/sysctl.d/99-kubernetes.conf <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
EOF
  sysctl --system
}

Recipe 3: Recover from a partial bootstrap

# If bootstrap is interrupted (network glitch, OOM), the marker won't be set.
# Re-running the script picks up where it left off, IF each step is idempotent.

# This is why the template uses install -d (idempotent dir create), id -u || useradd
# (idempotent user create), and curl with --retry. A failed run leaves clean state
# that a re-run can converge from.

# To force a re-run on an already-bootstrapped host:
sudo rm /var/lib/bootstrap/done.v1
sudo /var/lib/cloud/instances/$(cloud-init query instance-id)/user-data.txt
# Or trigger cloud-init to re-run user-data (rarely supported; see your distro docs).

Footgun List

  1. set -o pipefail doesn’t exist in busybox/ash. POSIX-strict bootstrap scripts use set -eu only. Move pipefail-dependent logic into bash sub-scripts called after bash is installed.

  2. [[ ... ]] is bash-only. POSIX uses [ ... ]. Don’t write [[ -d /opt/app ]] in a script that may run under busybox.

  3. Arrays are bash-only. Use space-separated strings or files-as-iteration-source.

  4. <<< (here-string) is bash-only. Use heredocs: cmd <<EOF\n$content\nEOF.

  5. apt-get and dpkg lock simultaneously. cloud-init’s parallel install races your script. Wait for the lock to release before any apt invocation.

  6. DNS may not resolve external names for the first 5–10 seconds. Always have wait_for_network before any curl.

  7. /etc/resolv.conf may be regenerated by cloud-init. Don’t modify it directly; use resolvectl or netplan.

  8. hostname set before cloud-init applies its hostname directive. Set hostname via cloud-config hostname: directive, not in your script.

  9. Editing /etc/hosts directly conflicts with cloud-init’s manage_etc_hosts: true. Pick one approach.

  10. Re-runs of cloud-init don’t re-run user-data by default. “Idempotent across reboots” is your responsibility; cloud-init runs user-data once unless you cloud-init clean --logs or remove /var/lib/cloud/instance/sem/config_scripts_user.

  11. Logs at /var/log/cloud-init.log and /var/log/cloud-init-output.log are different. First is cloud-init’s own log; second is the captured stdout/stderr of your scripts.

  12. exec >> log 2>&1 redirects all subsequent output, but cloud-init still captures it too via its own pipe — you get the output in two places. Acceptable for traceability.

Quick-Reference Card

┌─ POSIX-STRICT BOOTSTRAP ──────────────────────────────────────────────┐
│  #!/bin/sh           shell language                                   │
│  set -eu             no -o pipefail (bash-only)                      │
│  Use [ ] not [[ ]]                                                    │
│  No arrays, no <<<, no $'...'                                         │
│  Use heredocs for multi-line strings                                  │
└────────────────────────────────────────────────────────────────────────┘

┌─ cloud-init RUNTIME ──────────────────────────────────────────────────┐
│  Stages: local → init → config → final                                │
│  user-data shell scripts run in `final`                              │
│  cloud-config YAML is more reliable for static config                │
│  Logs: /var/log/cloud-init.log + /var/log/cloud-init-output.log      │
│  Re-run: `cloud-init clean --logs && reboot`                         │
└────────────────────────────────────────────────────────────────────────┘

┌─ METADATA SERVICES ───────────────────────────────────────────────────┐
│  AWS: 169.254.169.254/latest/meta-data + IMDSv2 token                 │
│  Azure: 169.254.169.254/metadata + Metadata: true header              │
│  GCP: metadata.google.internal + Metadata-Flavor: Google              │
│  Detect cloud by trying each (with --max-time 1)                     │
└────────────────────────────────────────────────────────────────────────┘

┌─ WAIT-FOR-X TIMEOUTS ─────────────────────────────────────────────────┐
│  Network up:        60–120s    DNS resolves                          │
│  systemd:           60s        is-system-running != initializing      │
│  apt/dnf lock:      300s       no apt-get/dpkg/dnf processes          │
│  Specific service:  30–60s     systemctl is-active                    │
└────────────────────────────────────────────────────────────────────────┘

┌─ DISTRO DETECTION ────────────────────────────────────────────────────┐
│  . /etc/os-release                ID, ID_LIKE, VERSION_ID             │
│  Map ID → debian / rhel / alpine / arch                               │
│  Per-distro: apt-get / dnf (yum fallback) / apk / pacman             │
│  DEBIAN_FRONTEND=noninteractive for apt prompts                      │
└────────────────────────────────────────────────────────────────────────┘

What’s Next

You’ve now bootstrapped a host from zero. Once it’s running, what makes it observable? The next lesson, Monitoring Agents in Shell: Writing Exporters, Health Probes & Watchdog Scripts, covers writing Prometheus-style exporters as shell scripts, building health-check endpoints, and wiring watchdogs that detect “the box is alive but the app is stuck” — the discipline that turns a bootstrapped host into a managed one.

shellcloud-initbootstrapprovisioningfirst-bootuser-datasystemd-firstbootbusyboxposix-strictmetadata-service
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments