Containerization Security

Hardening the Docker Daemon: Rootless Mode, User Namespace Remapping, and Custom seccomp/AppArmor Profiles

The default Docker install puts a root-owned daemon on a UNIX socket and runs your containers as UID 0 by default. That is two privilege problems stacked on top of each other: the daemon is a root-equivalent service, and the process inside the container is root mapped one-to-one onto the host’s root. This guide closes both gaps layer by layer — rootless daemon, user namespace remapping, capability dropping, and bespoke seccomp/AppArmor profiles — with commands you can run on a clean Ubuntu 22.04/24.04 host today.

1. The threat model

Defence in depth only makes sense once you name the escape vectors. These are the ones that actually matter at runtime, in rough order of how often they bite.

Vector What it abuses Primary control
Root-in-container -> root-on-host UID 0 in the container maps to UID 0 on the host through a bind mount or kernel bug userns-remap, or rootless daemon
Excess capabilities CAP_SYS_ADMIN, CAP_DAC_READ_SEARCH, CAP_NET_RAW granted by default --cap-drop ALL + selective add-back
Dangerous syscalls keyctl, unshare, mount, bpf, kernel-exploit primitives seccomp profile
Filesystem / network breakout Writing host paths, reaching the metadata endpoint AppArmor (or SELinux) confinement
Daemon socket exposure A container with /var/run/docker.sock mounted owns the host Rootless daemon; never mount the socket

The through-line is the root-in-container problem. By default the process inside a container runs as UID 0, and that UID 0 is the same UID 0 the kernel sees on the host. If the container ever touches a host resource — a bind-mounted directory, a device node, a leaked file descriptor — it does so with host root authority. Every layer below either removes that authority (rootless, userns-remap) or fences in what root can still do (capabilities, seccomp, AppArmor).

A useful mental model: capabilities decide what privileged operations a process may attempt, seccomp decides which syscalls it may issue, and AppArmor decides which files and sockets those syscalls may touch. They are independent gates; a syscall must pass all three.

2. Install and configure rootless Docker

Rootless mode runs dockerd and your containers entirely inside your own user’s namespaces. There is no root-owned daemon, so a daemon compromise yields your user, not the box.

Prerequisites

You need newuidmap/newgidmap (the setuid helpers that grant your unprivileged user a range of sub-UIDs) and a userspace network/storage stack.

# Helpers + userspace networking
sudo apt-get update
sudo apt-get install -y uidmap slirp4netns dbus-user-session fuse-overlayfs

# Confirm you have a subordinate ID range allocated (created by adduser on modern distros)
grep "^$(whoami):" /etc/subuid /etc/subgid
# vinod:100000:65536   <- 65536 IDs starting at 100000, for both files

If those lines are missing, allocate a non-overlapping range:

sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 "$(whoami)"

Install and start

Use the official rootless installer, then run the daemon as a systemd user unit so it survives logout via lingering.

# Pull and run the rootless setup script (installs the rootless dockerd shim)
curl -fsSL https://get.docker.com/rootless | sh

# Persist PATH + socket for this user
export PATH=$HOME/bin:$PATH
export DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock

# Run the user-level daemon now and keep it running after logout
systemctl --user enable --now docker
sudo loginctl enable-linger "$(whoami)"

Verify the daemon is genuinely rootless and using the userspace drivers:

docker info -f 'rootless={{println .SecurityOptions}}storage={{.Driver}}'
# rootless=[name=seccomp,profile=builtin name=rootless name=cgroupns]
# storage=overlay2     (fuse-overlayfs on older kernels; native overlay2 on 5.13+)

On kernel 5.13+ rootless Docker can use native overlay2 without fuse-overlayfs, which removes a significant I/O penalty. Keep fuse-overlayfs installed as the fallback for older kernels but check Storage Driver to confirm which you actually got.

Networking

By default rootless Docker uses slirp4netns for the container network, because an unprivileged user cannot create the host-side veth/bridge that rootful Docker uses. That is the cost of running without root. Outbound traffic and published ports work; raw performance and ICMP are limited. If you need throughput, install rootlesskit with bypass4netns or switch the port driver:

# ~/.config/systemd/user/docker.service.d/override.conf is one option, but the
# cleanest knob is the rootlesskit port driver. In ~/.config/docker/daemon.json
# you cannot set this; configure it via the service environment instead:
#   DOCKERD_ROOTLESS_ROOTLESSKIT_PORT_DRIVER=slirp4netns

The key security property holds regardless of port driver: published ports are bound by your user, not by root.

3. Enable userns-remap on a rootful daemon

You cannot always run rootless — shared CI hosts, GPU passthrough, and some storage drivers still need a rootful daemon. The next best thing is user namespace remapping: the daemon stays rootful, but every container’s UID 0 is transparently mapped to a high, unprivileged host UID. Container root is no longer host root.

Configure the daemon to remap to a dedicated dockremap user:

{
  "userns-remap": "default"
}
# /etc/docker/daemon.json contains the JSON above. "default" creates and uses
# the `dockremap` user/group and writes its ranges to /etc/subuid|/etc/subgid.
sudo systemctl restart docker

# Confirm the mapping is live
docker info -f '{{println .SecurityOptions}}'
# [name=seccomp,profile=builtin name=userns]

# Inside a container, root *appears* as UID 0...
docker run --rm alpine id
# uid=0(root) gid=0(root)

# ...but the host sees the remapped high UID owning that process
docker run -d --name probe alpine sleep 300
ps -o uid,cmd -C sleep
#   UID CMD
# 100000 sleep 300        <- container root is host UID 100000, not 0

Storage implications you must plan for. Remapping changes the on-disk ownership model in two ways:

# Make a host directory writable by remapped container root
sudo chown -R 100000:100000 /srv/appdata
docker run --rm -v /srv/appdata:/data alpine sh -c 'echo ok > /data/probe && cat /data/probe'

userns-remap is daemon-wide. You cannot remap some containers and not others on the same daemon, and a handful of features (--privileged, certain --network host + IPC combinations, and some external storage drivers) are incompatible. Validate your full workload set in staging before flipping it in production.

4. Drop all capabilities, add back only what you need

Even as remapped or rootless root, a container starts with a default capability set (CHOWN, DAC_OVERRIDE, FOWNER, SETUID, SETGID, NET_BIND_SERVICE, NET_RAW, and more). Most workloads need none of them after startup. Strip the lot and add back the minimum.

# A web service binding 8080 needs essentially nothing privileged
docker run --rm \
  --cap-drop ALL \
  --security-opt no-new-privileges \
  myapp:latest

# A service that must bind 80/443 directly needs exactly one capability
docker run --rm \
  --cap-drop ALL \
  --cap-add NET_BIND_SERVICE \
  --security-opt no-new-privileges \
  nginx:stable

In Compose this belongs in every service definition, not as an afterthought:

services:
  api:
    image: myapp:latest
    cap_drop: ["ALL"]
    cap_add: ["NET_BIND_SERVICE"]   # only if it binds < 1024
    security_opt:
      - "no-new-privileges:true"
    read_only: true                  # immutable rootfs; pair with tmpfs for /tmp
    tmpfs:
      - /tmp

no-new-privileges is the cheap, high-value flag people forget: it sets the kernel PR_SET_NO_NEW_PRIVS bit so a setuid binary inside the container can never gain privilege it was not started with. With --cap-drop ALL it neutralises the classic “setuid helper to regain caps” escalation.

Find the real minimum empirically. Start with --cap-drop ALL, run the workload’s full lifecycle, and add capabilities back one at a time only when you observe an EPERM the app cannot tolerate. Most stateless services run clean on zero.

5. Author a custom seccomp profile

Docker’s default seccomp profile already blocks ~44 dangerous syscalls. A bespoke profile goes further: deny by default, allow only what the workload calls. Build it from observation, not guesswork.

Trace the real syscall surface

Run the container with seccomp unconfined but under strace (you need SYS_PTRACE for the trace, which you remove again afterwards), and collect the unique syscalls across the workload’s lifecycle — startup, steady state, graceful shutdown.

docker run --rm \
  --security-opt seccomp=unconfined \
  --cap-add SYS_PTRACE \
  --entrypoint strace \
  myapp:latest -f -c -qq /usr/local/bin/myapp 2>strace.out

# Extract the syscall column from the summary table
awk 'NR>2 && $NF ~ /^[a-z_]+$/ {print $NF}' strace.out | sort -u > syscalls.txt

Generate the profile

Start from Docker’s default profile (it has the correct architecture and header structure) and append your traced syscalls into a single allow rule. The skeleton of a deny-by-default profile:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 1,
  "archMap": [
    {
      "architecture": "SCMP_ARCH_X86_64",
      "subArchitectures": ["SCMP_ARCH_X86", "SCMP_ARCH_X32"]
    }
  ],
  "syscalls": [
    {
      "names": [
        "accept4", "bind", "brk", "close", "connect", "epoll_create1",
        "epoll_ctl", "epoll_pwait", "exit_group", "fstat", "futex",
        "getpid", "getrandom", "listen", "mmap", "mprotect", "munmap",
        "nanosleep", "openat", "read", "rt_sigaction", "rt_sigprocmask",
        "sendto", "set_robust_list", "setsockopt", "socket", "write"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

defaultAction: SCMP_ACT_ERRNO means any syscall not explicitly allowed returns an error rather than killing the process — easier to debug than SCMP_ACT_KILL, which terminates instantly. Apply it:

docker run --rm \
  --security-opt seccomp=/path/to/myapp-seccomp.json \
  --cap-drop ALL \
  --security-opt no-new-privileges \
  myapp:latest

Trace on the same kernel and libc you run in production. A glibc upgrade can swap epoll_wait for epoll_pwait2, or open for openat, and an over-tight profile will then fail in production but pass in your old test image. Re-trace on base-image bumps and treat the profile as a versioned artifact next to the Dockerfile.

6. Write and load an AppArmor profile

Capabilities and seccomp gate operations and syscalls; AppArmor gates objects — which paths and sockets a confined process may touch. Docker ships a docker-default profile; a custom one lets you forbid, say, all writes outside /tmp and all raw network access for a workload that only speaks TCP.

Author a profile that confines a service to read its binary, write only /tmp and /var/run, and use TCP/UDP only:

# Save as /etc/apparmor.d/docker-myapp
#include <tunables/global>

profile docker-myapp flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>

  network inet tcp,
  network inet udp,
  network inet6 tcp,
  network inet6 udp,
  deny network raw,
  deny network packet,

  # Read-only application code
  /usr/local/bin/myapp r,
  /usr/local/lib/** mr,

  # Writable scratch only
  /tmp/ rw,
  /tmp/** rw,
  /var/run/ rw,
  /var/run/** rw,

  # Hard denies for classic breakout paths
  deny /proc/sys/kernel/** w,
  deny /sys/** w,
  deny mount,
  deny /** wl,            # default-deny writes/links anywhere not allowed above
}

Load it into the kernel and run the container under it:

# Parse and load (replace -r when iterating)
sudo apparmor_parser -r -W /etc/apparmor.d/docker-myapp

# Confirm it is loaded
sudo aa-status | grep docker-myapp

# Run confined
docker run --rm \
  --security-opt apparmor=docker-myapp \
  --cap-drop ALL \
  --security-opt no-new-privileges \
  myapp:latest

The ordering of rules matters: AppArmor takes the most specific match, so the explicit /tmp/** rw wins over the trailing deny /** wl. Build the profile in complain mode first (flags=(complain) or aa-complain), exercise the workload, then read /var/log/syslog or journalctl -k for apparmor="ALLOWED" audit lines to discover legitimate accesses before switching to enforce.

AppArmor is path-based and Ubuntu/Debian-native; RHEL-family hosts use SELinux instead, which is label-based and configured through --security-opt label=.... Pick the one your distro ships and enforces by default. Running neither is the failure mode to avoid.

Verify

Prove each layer with a focused privilege-escalation test battery. A hardened container should fail every one of these.

# 1. Container root must NOT be host root (userns-remap or rootless).
docker run -d --name esc --cap-drop ALL alpine sleep 600
ps -o uid,cmd -C sleep | grep sleep      # expect a high UID (100000+), never 0

# 2. Privileged file ops should be denied without DAC_OVERRIDE.
docker run --rm --cap-drop ALL alpine \
  sh -c 'touch /etc/cannot_write 2>&1 || echo "DENIED (good)"'

# 3. A blocked syscall must error under the seccomp profile.
docker run --rm --security-opt seccomp=/path/to/myapp-seccomp.json alpine \
  sh -c 'unshare -U 2>&1 || echo "unshare DENIED (good)"'

# 4. Raw sockets blocked by AppArmor + dropped NET_RAW.
docker run --rm --cap-drop ALL --security-opt apparmor=docker-myapp alpine \
  sh -c 'ping -c1 127.0.0.1 2>&1 || echo "raw socket DENIED (good)"'

# 5. no-new-privileges blocks setuid escalation.
docker run --rm --security-opt no-new-privileges --cap-drop ALL alpine \
  sh -c 'echo "nnp active:"; cat /proc/self/status | grep NoNewPrivs'
# NoNewPrivs:    1   <- escalation via setuid binaries is impossible

# 6. The host docker socket must never be reachable from a workload.
docker run --rm alpine sh -c 'ls /var/run/docker.sock 2>&1 || echo "no socket (good)"'

docker rm -f esc

If any test succeeds where it should be denied, that layer is misconfigured — most often a stray --privileged, a forgotten --cap-add, or a profile that did not load.

Trade-offs

Hardening is not free; budget for these before you roll it out fleet-wide.

Decision Cost When it bites
Rootless networking (slirp4netns) Lower throughput, no native ICMP, NAT overhead High-PPS or latency-sensitive services; use bypass4netns
Ports below 1024 rootless Unprivileged users cannot bind <1024 Bind high and front with a host reverse proxy, or set net.ipv4.ip_unprivileged_port_start
Storage drivers Rootless prefers overlay2 (5.13+) or fuse-overlayfs; devicemapper/some volume plugins unsupported GPU, FUSE-heavy, or vendor-storage workloads
userns-remap disk Separate /var/lib/docker/<map> storage root; images re-pulled, bind mounts need chown First rollout; plan disk + a maintenance window
Over-tight seccomp/AppArmor Production-only failures after libc/kernel bumps Treat profiles as versioned artifacts; re-trace on base-image changes

The honest summary: rootless plus capability dropping is the highest security-per-effort and should be the default for stateless services. userns-remap is the pragmatic answer when you must keep a rootful daemon. Custom seccomp and AppArmor profiles are worth the authoring cost for your crown-jewel workloads but are overkill applied blindly to everything — start with the hardened defaults and tighten where the blast radius justifies it.

Enterprise scenario

A fintech platform team ran a multi-tenant GitLab CI fleet where every runner exposed /var/run/docker.sock into build containers so jobs could run docker build. A red-team exercise broke out trivially: any pipeline could mount a host path through the shared socket and read another tenant’s checked-out secrets. The socket was the host.

They could not go fully rootless overnight — some jobs needed buildx with a specific storage driver — so they staged it. First, they killed socket mounting entirely and moved image builds to rootless BuildKit running as a sidecar per job, so each build ran inside the job’s own user namespace with no host-root daemon anywhere in the path. For the residual rootful runners (GPU integration tests), they enabled userns-remap and pinned a per-runner subordinate range so tenants could not collide on host UIDs.

# .gitlab-ci.yml — rootless image build, no docker socket, no privileged flag
build:
  image: moby/buildkit:rootless
  variables:
    BUILDKITD_FLAGS: --oci-worker-no-process-sandbox
  script:
    - mkdir -p ~/.docker && echo '{}' > ~/.docker/config.json
    - buildctl-daemonless.sh build
        --frontend dockerfile.v0
        --local context=.
        --local dockerfile=.
        --output type=image,name=registry.acme.io/app:$CI_COMMIT_SHA,push=true

The constraint that drove the design was tenant isolation under a shared runner pool, and the fix was to ensure no privileged daemon was ever reachable from tenant code. After rollout, the same red-team breakout returned permission denied at the namespace boundary — the build no longer had a host-root socket to abuse, and the GPU runners’ remapping meant a container escape landed on an unprivileged, per-runner UID with nothing to steal.

Checklist

dockersecurityrootlessseccomplinux

Comments

Keep Reading