Why Bootstrap Is Its Own Discipline
You’ve written hundreds of shell scripts. Then you’re handed: “write the script that turns a freshly-booted Linux VM into a member of our cluster.” You write the same code you usually write — curl -fsSL https://api.example.com/..., jq -r '.token', apt-get install -y nginx — and 30% of the boots fail with errors you’ve never seen:
curl: command not found(Alpine, busybox-only)jq: command not found(some Amazon Linux AMIs)apt-get: not found(RHEL-family hosts)nslookup: SERVFAIL(DNS isn’t up yet — it’s 4 seconds into boot)connect: Network is unreachable(interface isloonly; eth0 not yet brought up)Error: Could not get lock /var/lib/dpkg/lock-frontend(cloud-init is also installing packages in parallel)
Bootstrap scripts run in a hostile environment because the system isn’t done assembling itself yet. The shell exists, PID 1 is running, but most of the userland tools you depend on either don’t exist or aren’t ready.
The disciplines that matter:
| Discipline | Why |
|---|---|
| POSIX-strict, busybox-compatible | Your script may run on Alpine where [[ ]], <<<, and arrays don’t exist |
| Wait-for-X loops with timeouts | Network, DNS, package manager locks, services — all may not be ready |
| Detect-then-act distro detection | apt-get on Ubuntu, dnf on RHEL, apk on Alpine — no universal package CLI |
| Idempotent from zero | Cloud may re-run user-data; running again must not break |
| No external dependencies on first run | Don’t curl https://... for code; embed it inline or fetch from instance metadata |
| Fail loud and recoverable | A failed bootstrap should leave clear logs and not produce a half-configured host |
This lesson is the cross-distro pattern set: cloud-init anatomy, the metadata-service contract, network-up detection, busybox-safe shell, and a copy-pasteable bootstrap template.
cloud-init: The Bootstrap Framework You Already Have
cloud-init is the de facto bootstrap framework on AWS, Azure, GCP, OpenStack, and bare metal. When a VM boots, cloud-init reads “user-data” supplied by the platform and acts on it. User-data can be:
- A shell script (starts with
#!/bin/shor#!) - A cloud-config YAML (starts with
#cloud-config) - A multi-part MIME archive (multiple shell scripts and cloud-configs together)
- A gzipped variant of any of the above (cloud-init auto-decompresses)
cloud-init runtime stages
BOOT
│
▼
┌──────────────────────────┐
│ cloud-init local │ before networking
│ (datasource, hostname) │
└────────────┬─────────────┘
│
▼
┌──────────────────────────┐
│ cloud-init init │ network is up
│ (resize disks, ssh keys)│
└────────────┬─────────────┘
│
▼
┌──────────────────────────┐
│ cloud-init config │ modules: write_files, runcmd, etc.
│ (apt sources, packages) │
└────────────┬─────────────┘
│
▼
┌──────────────────────────┐
│ cloud-init final │ user_data runs here (shell scripts)
│ (runcmd, scripts, etc.) │
└────────────┬─────────────┘
│
▼
READY
User-data scripts run during the final stage — after networking is configured, after package sources are set up, but before the system is fully “ready” for users. You’re root, you have network, you have a stable hostname, but other services may still be starting.
A minimal user-data shell script
#!/bin/sh
# cloud-init user-data: bootstrap-v1
# Runs ONCE on first boot. Output: /var/log/cloud-init-output.log
set -eu # POSIX-strict; no -o pipefail (not POSIX)
exec >> /var/log/bootstrap.log 2>&1
echo "[$(date -u +%FT%TZ)] bootstrap starting"
# ... your bootstrap work ...
echo "[$(date -u +%FT%TZ)] bootstrap complete"
Three notes:
set -eu— POSIX-strict.set -euo pipefailis bash-only; if your script will run on Alpine (busyboxash), useset -euonly.exec >> file 2>&1— redirects everything from this point on. Output also goes to/var/log/cloud-init-output.log(cloud-init captures stdout) but we want a separate log we control.- The shebang
#!/bin/shis important: cloud-init recognizes the script type by shebang.#!/bin/bashworks but is non-portable across Alpine.
cloud-config: declarative bootstrap
For straightforward cases, cloud-config YAML is more reliable than shell scripts:
#cloud-config
hostname: web-001
fqdn: web-001.prod.internal
manage_etc_hosts: true
users:
- name: deploy
groups: sudo
shell: /bin/bash
sudo: 'ALL=(ALL) NOPASSWD:ALL'
ssh_authorized_keys:
- ssh-ed25519 AAAA... deploy@example
write_files:
- path: /etc/myapp/config.json
permissions: '0640'
owner: 'root:myapp'
content: |
{"port": 8080, "log_level": "info"}
package_update: true
package_upgrade: false # don't auto-upgrade in production; pin versions
packages:
- curl
- jq
- chrony
runcmd:
- systemctl enable --now chrony
- /opt/myapp/bin/post-install.sh
cloud-config is declarative and idempotent by design. Use it for the static parts (users, packages, files); reserve shell scripts for dynamic logic that cloud-config can’t express.
Multi-part user-data
For complex bootstraps, combine cloud-config and shell:
Content-Type: multipart/mixed; boundary="===PART==="
MIME-Version: 1.0
--===PART===
Content-Type: text/cloud-config
#cloud-config
packages: [jq, curl]
--===PART===
Content-Type: text/x-shellscript
#!/bin/sh
set -eu
echo "shell stage"
# ... your provisioning ...
--===PART===--
This runs the cloud-config first (installing jq, curl), then your shell script with those tools available. Generated easily with cloud-init devel make-mime:
cloud-init devel make-mime \
-a packages.cfg:cloud-config \
-a bootstrap.sh:x-shellscript \
> combined.mime
The Metadata Service: Bootstrap-Time Configuration
Each cloud platform exposes an HTTP metadata service at a well-known address that VMs can query for instance-specific data: hostname, IP, region, IAM credentials, user-supplied tags, and arbitrary user-data.
| Cloud | Endpoint | Token required? |
|---|---|---|
| AWS | http://169.254.169.254/latest/meta-data/ |
IMDSv2 requires PUT to get a token |
| Azure | http://169.254.169.254/metadata/instance?api-version=2021-02-01 |
Header: Metadata: true |
| GCP | http://metadata.google.internal/computeMetadata/v1/ |
Header: Metadata-Flavor: Google |
AWS IMDSv2 (the modern, secured version)
# Get a session token (valid 6 hours).
TOKEN=$(curl -fsS -X PUT 'http://169.254.169.254/latest/api/token' \
-H 'X-aws-ec2-metadata-token-ttl-seconds: 21600')
# Use it.
INSTANCE_ID=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/instance-id)
REGION=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/placement/region)
ROLE=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/iam/security-credentials/)
# Get IAM credentials for the role.
CREDS=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
"http://169.254.169.254/latest/meta-data/iam/security-credentials/$ROLE")
# CREDS is JSON: { AccessKeyId, SecretAccessKey, Token, Expiration, ... }
IMDSv1 (no token) is being phased out. Always use IMDSv2 in new bootstrap scripts.
Azure metadata
INSTANCE=$(curl -fsS -H 'Metadata: true' \
'http://169.254.169.254/metadata/instance?api-version=2021-02-01')
# Parse with jq if available; otherwise sed.
VM_NAME=$(echo "$INSTANCE" | jq -r '.compute.name')
REGION=$(echo "$INSTANCE" | jq -r '.compute.location')
For managed-identity tokens:
TOKEN=$(curl -fsS -H 'Metadata: true' \
'http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https://management.azure.com/' \
| jq -r '.access_token')
GCP metadata
INSTANCE=$(curl -fsS -H 'Metadata-Flavor: Google' \
'http://metadata.google.internal/computeMetadata/v1/instance/?recursive=true')
ZONE=$(curl -fsS -H 'Metadata-Flavor: Google' \
'http://metadata.google.internal/computeMetadata/v1/instance/zone' \
| awk -F/ '{print $NF}')
# Custom metadata key:
DEPLOY_ENV=$(curl -fsS -H 'Metadata-Flavor: Google' \
'http://metadata.google.internal/computeMetadata/v1/instance/attributes/deploy-env')
# IAM identity token (for OIDC auth):
TOKEN=$(curl -fsS -H 'Metadata-Flavor: Google' \
'http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/identity?audience=https://my-service')
The metadata-service-as-secret-store pattern
Bootstrap scripts often need configuration that varies per-instance: which database to connect to, which API endpoint, what role to assume. Don’t bake these into AMIs; pass them via metadata:
# AWS launch template sets user-data including custom data:
# {"db_endpoint": "prod-db.us-west-2.internal", "feature_flags": ["x", "y"]}
USER_DATA=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/user-data)
# Or instance tags via API (requires IAM):
TAGS=$(aws ec2 describe-tags \
--filters "Name=resource-id,Values=$INSTANCE_ID" \
--query 'Tags[].[Key,Value]' --output text)
For secrets that shouldn’t be in user-data (because user-data is sometimes logged), fetch from AWS Secrets Manager / Azure Key Vault / GCP Secret Manager during bootstrap, using IAM that the metadata service makes available.
Wait-For-X Patterns: Don’t Race the System
Bootstrap is racy. Things that “should be there” might not be yet. The discipline: never assume; wait with a timeout.
Wait for network connectivity
wait_for_network() {
timeout=${1:-60}
i=0
while [ "$i" -lt "$timeout" ]; do
# POSIX: do not use [[ ]] or arrays.
if getent hosts deb.debian.org >/dev/null 2>&1 || \
getent hosts amazon.com >/dev/null 2>&1 || \
getent hosts google.com >/dev/null 2>&1; then
return 0
fi
sleep 1
i=$((i + 1))
done
echo "wait_for_network: timed out after ${timeout}s" >&2
return 1
}
wait_for_network 60 || exit 1
Why three hosts? Because DNS-up + one specific host might be unreachable for unrelated reasons (a peering issue, a firewall rule). Three independent zones means “the internet is generally working.”
ping is the wrong primitive: ICMP is often blocked. DNS resolution + connection attempt is more reliable.
Wait for package-manager lock to release
wait_for_apt() {
timeout=${1:-300}
i=0
while [ "$i" -lt "$timeout" ]; do
if ! pgrep -x apt-get >/dev/null 2>&1 && ! pgrep -x dpkg >/dev/null 2>&1; then
# No apt process running.
return 0
fi
sleep 2
i=$((i + 2))
done
return 1
}
wait_for_apt 300 || { echo "apt-get is busy; aborting"; exit 1; }
apt-get update
cloud-init runs apt itself in parallel during the config stage. If your final-stage script also runs apt, you race. Wait for the lock.
For dnf/yum: pgrep dnf|yum. For apk: pgrep apk.
Wait for systemd to finish booting
wait_for_systemd_running() {
timeout=${1:-120}
i=0
while [ "$i" -lt "$timeout" ]; do
state=$(systemctl is-system-running 2>/dev/null || true)
case "$state" in
running|degraded) return 0 ;;
esac
sleep 2
i=$((i + 2))
done
return 1
}
wait_for_systemd_running
systemctl is-system-running returns:
initializing/starting— boot in progressrunning— fully booteddegraded— booted but some services failedmaintenance/stopping— shutdown in progress
degraded is acceptable for “system is functional”; it just means some non-critical service didn’t start.
Wait for a specific service
wait_for_service() {
service=$1
timeout=${2:-60}
i=0
while [ "$i" -lt "$timeout" ]; do
if systemctl is-active --quiet "$service" 2>/dev/null; then
return 0
fi
sleep 1
i=$((i + 1))
done
return 1
}
wait_for_service docker 30 || exit 1
Distro Detection: One Detection, Three Code Paths
detect_distro() {
if [ -r /etc/os-release ]; then
# /etc/os-release is the modern standard.
. /etc/os-release # exports ID, ID_LIKE, VERSION_ID, etc.
case "$ID" in
ubuntu|debian) DISTRO=debian; PKG=apt-get ;;
rhel|centos|fedora|rocky|almalinux|amzn) DISTRO=rhel; PKG=dnf ;;
alpine) DISTRO=alpine; PKG=apk ;;
arch) DISTRO=arch; PKG=pacman ;;
*)
# Fall back to ID_LIKE for derivatives.
case "$ID_LIKE" in
*debian*) DISTRO=debian; PKG=apt-get ;;
*rhel*|*fedora*) DISTRO=rhel; PKG=dnf ;;
*) DISTRO=unknown; PKG="" ;;
esac
;;
esac
else
DISTRO=unknown; PKG=""
fi
# dnf may not exist on older RHEL/CentOS 7; fall back to yum.
if [ "$PKG" = "dnf" ] && ! command -v dnf >/dev/null 2>&1; then
PKG=yum
fi
export DISTRO PKG
}
detect_distro
echo "Detected: $DISTRO using $PKG"
/etc/os-release is supported on every modern Linux; it’s the canonical source. Older systems had /etc/redhat-release, /etc/alpine-release, etc. — fall back to those if needed.
Cross-distro install function
install_pkg() {
pkg=$1
case "$DISTRO" in
debian)
DEBIAN_FRONTEND=noninteractive apt-get install -y "$pkg"
;;
rhel)
"$PKG" install -y "$pkg"
;;
alpine)
apk add --no-cache "$pkg"
;;
arch)
pacman -Sy --noconfirm "$pkg"
;;
*)
echo "install_pkg: unknown distro $DISTRO" >&2
return 1
;;
esac
}
install_pkg jq
install_pkg curl
Note DEBIAN_FRONTEND=noninteractive for apt: prevents prompts for things like grub config that hang the bootstrap.
A Cross-Distro Bootstrap Template
#!/bin/sh
# bootstrap.sh — first-boot provisioning, POSIX-strict.
# Usable as cloud-init user-data on AWS, Azure, GCP, bare metal.
set -eu
exec >> /var/log/bootstrap.log 2>&1
log() { echo "[$(date -u +%FT%TZ)] [bootstrap] $*"; }
# ─── 0. Idempotency guard ─────────────────────────────────────────────────
MARKER=/var/lib/bootstrap/done.v1
if [ -f "$MARKER" ]; then
log "already bootstrapped at $(cat "$MARKER"); skipping"
exit 0
fi
# ─── 1. Wait for system to be ready ───────────────────────────────────────
log "waiting for systemd to settle..."
i=0
while [ "$i" -lt 60 ]; do
state=$(systemctl is-system-running 2>/dev/null || true)
case "$state" in
running|degraded) break ;;
esac
sleep 2
i=$((i + 2))
done
log "waiting for network..."
i=0
while [ "$i" -lt 60 ]; do
if getent hosts deb.debian.org >/dev/null 2>&1 \
|| getent hosts amazon.com >/dev/null 2>&1 \
|| getent hosts google.com >/dev/null 2>&1; then
break
fi
sleep 1
i=$((i + 1))
done
# ─── 2. Distro detection ──────────────────────────────────────────────────
. /etc/os-release
case "$ID" in
ubuntu|debian) DISTRO=debian ;;
rhel|centos|fedora|rocky|almalinux|amzn) DISTRO=rhel ;;
alpine) DISTRO=alpine ;;
*) log "unknown distro: $ID"; exit 1 ;;
esac
log "detected: $DISTRO ($PRETTY_NAME)"
# ─── 3. Wait for package manager lock ─────────────────────────────────────
log "waiting for package manager lock..."
i=0
while [ "$i" -lt 300 ]; do
case "$DISTRO" in
debian)
pgrep -x apt-get >/dev/null 2>&1 || pgrep -x dpkg >/dev/null 2>&1 || break
;;
rhel)
pgrep -x dnf >/dev/null 2>&1 || pgrep -x yum >/dev/null 2>&1 || break
;;
alpine)
pgrep -x apk >/dev/null 2>&1 || break
;;
esac
sleep 2
i=$((i + 2))
done
# ─── 4. Install base packages ─────────────────────────────────────────────
log "installing base packages..."
case "$DISTRO" in
debian)
DEBIAN_FRONTEND=noninteractive apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y \
curl jq ca-certificates chrony
;;
rhel)
yum install -y curl jq ca-certificates chrony 2>/dev/null \
|| dnf install -y curl jq ca-certificates chrony
;;
alpine)
apk add --no-cache curl jq ca-certificates chrony bash
;;
esac
# ─── 5. Fetch metadata (cloud-specific) ───────────────────────────────────
log "fetching instance metadata..."
fetch_metadata_aws() {
TOKEN=$(curl -fsS -X PUT 'http://169.254.169.254/latest/api/token' \
-H 'X-aws-ec2-metadata-token-ttl-seconds: 21600' || true)
if [ -n "${TOKEN:-}" ]; then
INSTANCE_ID=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/instance-id || true)
REGION=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/placement/region || true)
log "AWS: instance=$INSTANCE_ID region=$REGION"
echo "$INSTANCE_ID" > /etc/instance-id
echo "$REGION" > /etc/region
fi
}
fetch_metadata_azure() {
if curl -fsS -H 'Metadata: true' \
'http://169.254.169.254/metadata/instance?api-version=2021-02-01' >/tmp/azure.json; then
VM_NAME=$(jq -r '.compute.name' /tmp/azure.json)
REGION=$(jq -r '.compute.location' /tmp/azure.json)
log "Azure: vm=$VM_NAME region=$REGION"
echo "$VM_NAME" > /etc/instance-id
echo "$REGION" > /etc/region
fi
}
fetch_metadata_gcp() {
if curl -fsS -H 'Metadata-Flavor: Google' \
http://metadata.google.internal/computeMetadata/v1/instance/id >/tmp/gcp.id; then
INSTANCE_ID=$(cat /tmp/gcp.id)
ZONE=$(curl -fsS -H 'Metadata-Flavor: Google' \
http://metadata.google.internal/computeMetadata/v1/instance/zone \
| awk -F/ '{print $NF}')
log "GCP: instance=$INSTANCE_ID zone=$ZONE"
echo "$INSTANCE_ID" > /etc/instance-id
echo "$ZONE" > /etc/region
fi
}
# Detect cloud and fetch.
if curl -fsS --max-time 1 'http://169.254.169.254/latest/meta-data/' \
-H 'X-aws-ec2-metadata-token: 1' >/dev/null 2>&1 \
|| curl -fsS --max-time 1 -X PUT 'http://169.254.169.254/latest/api/token' \
-H 'X-aws-ec2-metadata-token-ttl-seconds: 60' >/dev/null 2>&1; then
fetch_metadata_aws
elif curl -fsS --max-time 1 -H 'Metadata: true' \
'http://169.254.169.254/metadata/instance?api-version=2021-02-01' >/dev/null 2>&1; then
fetch_metadata_azure
elif curl -fsS --max-time 1 -H 'Metadata-Flavor: Google' \
http://metadata.google.internal/computeMetadata/v1/instance/id >/dev/null 2>&1; then
fetch_metadata_gcp
else
log "no recognizable metadata service; running on bare metal?"
fi
# ─── 6. Configure timekeeping ─────────────────────────────────────────────
log "starting chrony..."
systemctl enable --now chrony chronyd 2>/dev/null || \
systemctl enable --now chronyd 2>/dev/null || \
systemctl enable --now chrony 2>/dev/null || true
# ─── 7. Create deploy user ────────────────────────────────────────────────
log "creating deploy user..."
if ! id -u deploy >/dev/null 2>&1; then
useradd --system --create-home --shell /bin/bash deploy
install -d -m 0700 -o deploy -g deploy /home/deploy/.ssh
fi
# ─── 8. Pull and run the post-bootstrap configuration ─────────────────────
log "fetching post-bootstrap configuration..."
mkdir -p /opt/bootstrap
if [ -f /etc/instance-id ]; then
curl -fsSL --retry 3 --retry-delay 5 \
"https://config.example.com/$(cat /etc/instance-id)/post-install.sh" \
-o /opt/bootstrap/post-install.sh
chmod +x /opt/bootstrap/post-install.sh
/opt/bootstrap/post-install.sh
fi
# ─── 9. Mark complete ─────────────────────────────────────────────────────
mkdir -p "$(dirname "$MARKER")"
date -u +%FT%TZ > "$MARKER"
log "bootstrap complete"
Real-World Recipes
Recipe 1: Inject SSH keys from instance metadata
# AWS: SSH keys are at /latest/meta-data/public-keys/
inject_aws_ssh_keys() {
TOKEN=$(curl -fsS -X PUT 'http://169.254.169.254/latest/api/token' \
-H 'X-aws-ec2-metadata-token-ttl-seconds: 60')
KEY_INDEXES=$(curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/public-keys/ \
| awk -F= '{print $1}')
for idx in $KEY_INDEXES; do
curl -fsS -H "X-aws-ec2-metadata-token: $TOKEN" \
"http://169.254.169.254/latest/meta-data/public-keys/$idx/openssh-key" \
>> /home/deploy/.ssh/authorized_keys
done
chmod 0600 /home/deploy/.ssh/authorized_keys
chown deploy:deploy /home/deploy/.ssh/authorized_keys
}
Recipe 2: Bootstrap a Kubernetes node
# Install containerd and kubelet on a fresh Ubuntu host.
bootstrap_k8s_node() {
apt-get update
apt-get install -y curl ca-certificates apt-transport-https
# containerd
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
| gpg --dearmor -o /etc/apt/keyrings/docker.gpg
cat >/etc/apt/sources.list.d/docker.list <<EOF
deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable
EOF
apt-get update
apt-get install -y containerd.io
containerd config default >/etc/containerd/config.toml
systemctl restart containerd
# kubelet, kubeadm
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.30/deb/Release.key \
| gpg --dearmor -o /etc/apt/keyrings/kubernetes.gpg
echo "deb [signed-by=/etc/apt/keyrings/kubernetes.gpg] \
https://pkgs.k8s.io/core:/stable:/v1.30/deb/ /" \
> /etc/apt/sources.list.d/kubernetes.list
apt-get update
apt-get install -y kubelet kubeadm kubectl
apt-mark hold kubelet kubeadm kubectl
# Disable swap (kubelet requirement).
swapoff -a
sed -i '/swap/s/^/#/' /etc/fstab
# Sysctl tuning for k8s.
cat >/etc/sysctl.d/99-kubernetes.conf <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
EOF
sysctl --system
}
Recipe 3: Recover from a partial bootstrap
# If bootstrap is interrupted (network glitch, OOM), the marker won't be set.
# Re-running the script picks up where it left off, IF each step is idempotent.
# This is why the template uses install -d (idempotent dir create), id -u || useradd
# (idempotent user create), and curl with --retry. A failed run leaves clean state
# that a re-run can converge from.
# To force a re-run on an already-bootstrapped host:
sudo rm /var/lib/bootstrap/done.v1
sudo /var/lib/cloud/instances/$(cloud-init query instance-id)/user-data.txt
# Or trigger cloud-init to re-run user-data (rarely supported; see your distro docs).
Footgun List
-
set -o pipefaildoesn’t exist in busybox/ash. POSIX-strict bootstrap scripts useset -euonly. Move pipefail-dependent logic into bash sub-scripts called after bash is installed. -
[[ ... ]]is bash-only. POSIX uses[ ... ]. Don’t write[[ -d /opt/app ]]in a script that may run under busybox. -
Arrays are bash-only. Use space-separated strings or files-as-iteration-source.
-
<<<(here-string) is bash-only. Use heredocs:cmd <<EOF\n$content\nEOF. -
apt-getanddpkglock simultaneously. cloud-init’s parallel install races your script. Wait for the lock to release before any apt invocation. -
DNS may not resolve external names for the first 5–10 seconds. Always have
wait_for_networkbefore anycurl. -
/etc/resolv.confmay be regenerated by cloud-init. Don’t modify it directly; useresolvectlor netplan. -
hostnameset before cloud-init applies its hostname directive. Set hostname via cloud-confighostname:directive, not in your script. -
Editing
/etc/hostsdirectly conflicts with cloud-init’smanage_etc_hosts: true. Pick one approach. -
Re-runs of cloud-init don’t re-run user-data by default. “Idempotent across reboots” is your responsibility; cloud-init runs user-data once unless you
cloud-init clean --logsor remove/var/lib/cloud/instance/sem/config_scripts_user. -
Logs at
/var/log/cloud-init.logand/var/log/cloud-init-output.logare different. First is cloud-init’s own log; second is the captured stdout/stderr of your scripts. -
exec >> log 2>&1redirects all subsequent output, but cloud-init still captures it too via its own pipe — you get the output in two places. Acceptable for traceability.
Quick-Reference Card
┌─ POSIX-STRICT BOOTSTRAP ──────────────────────────────────────────────┐
│ #!/bin/sh shell language │
│ set -eu no -o pipefail (bash-only) │
│ Use [ ] not [[ ]] │
│ No arrays, no <<<, no $'...' │
│ Use heredocs for multi-line strings │
└────────────────────────────────────────────────────────────────────────┘
┌─ cloud-init RUNTIME ──────────────────────────────────────────────────┐
│ Stages: local → init → config → final │
│ user-data shell scripts run in `final` │
│ cloud-config YAML is more reliable for static config │
│ Logs: /var/log/cloud-init.log + /var/log/cloud-init-output.log │
│ Re-run: `cloud-init clean --logs && reboot` │
└────────────────────────────────────────────────────────────────────────┘
┌─ METADATA SERVICES ───────────────────────────────────────────────────┐
│ AWS: 169.254.169.254/latest/meta-data + IMDSv2 token │
│ Azure: 169.254.169.254/metadata + Metadata: true header │
│ GCP: metadata.google.internal + Metadata-Flavor: Google │
│ Detect cloud by trying each (with --max-time 1) │
└────────────────────────────────────────────────────────────────────────┘
┌─ WAIT-FOR-X TIMEOUTS ─────────────────────────────────────────────────┐
│ Network up: 60–120s DNS resolves │
│ systemd: 60s is-system-running != initializing │
│ apt/dnf lock: 300s no apt-get/dpkg/dnf processes │
│ Specific service: 30–60s systemctl is-active │
└────────────────────────────────────────────────────────────────────────┘
┌─ DISTRO DETECTION ────────────────────────────────────────────────────┐
│ . /etc/os-release ID, ID_LIKE, VERSION_ID │
│ Map ID → debian / rhel / alpine / arch │
│ Per-distro: apt-get / dnf (yum fallback) / apk / pacman │
│ DEBIAN_FRONTEND=noninteractive for apt prompts │
└────────────────────────────────────────────────────────────────────────┘
What’s Next
You’ve now bootstrapped a host from zero. Once it’s running, what makes it observable? The next lesson, Monitoring Agents in Shell: Writing Exporters, Health Probes & Watchdog Scripts, covers writing Prometheus-style exporters as shell scripts, building health-check endpoints, and wiring watchdogs that detect “the box is alive but the app is stuck” — the discipline that turns a bootstrapped host into a managed one.