The default Docker install puts a root-owned daemon on a UNIX socket and runs your containers as UID 0 by default. That is two privilege problems stacked on top of each other: the daemon is a root-equivalent service, and the process inside the container is root mapped one-to-one onto the host’s root. This guide closes both gaps layer by layer — rootless daemon, user namespace remapping, capability dropping, and bespoke seccomp/AppArmor profiles — with commands you can run on a clean Ubuntu 22.04/24.04 host today.
1. The threat model
Defence in depth only makes sense once you name the escape vectors. These are the ones that actually matter at runtime, in rough order of how often they bite.
| Vector | What it abuses | Primary control |
|---|---|---|
| Root-in-container -> root-on-host | UID 0 in the container maps to UID 0 on the host through a bind mount or kernel bug | userns-remap, or rootless daemon |
| Excess capabilities | CAP_SYS_ADMIN, CAP_DAC_READ_SEARCH, CAP_NET_RAW granted by default |
--cap-drop ALL + selective add-back |
| Dangerous syscalls | keyctl, unshare, mount, bpf, kernel-exploit primitives |
seccomp profile |
| Filesystem / network breakout | Writing host paths, reaching the metadata endpoint | AppArmor (or SELinux) confinement |
| Daemon socket exposure | A container with /var/run/docker.sock mounted owns the host |
Rootless daemon; never mount the socket |
The through-line is the root-in-container problem. By default the process inside a container runs as UID 0, and that UID 0 is the same UID 0 the kernel sees on the host. If the container ever touches a host resource — a bind-mounted directory, a device node, a leaked file descriptor — it does so with host root authority. Every layer below either removes that authority (rootless, userns-remap) or fences in what root can still do (capabilities, seccomp, AppArmor).
A useful mental model: capabilities decide what privileged operations a process may attempt, seccomp decides which syscalls it may issue, and AppArmor decides which files and sockets those syscalls may touch. They are independent gates; a syscall must pass all three.
2. Install and configure rootless Docker
Rootless mode runs dockerd and your containers entirely inside your own user’s namespaces. There is no root-owned daemon, so a daemon compromise yields your user, not the box.
Prerequisites
You need newuidmap/newgidmap (the setuid helpers that grant your unprivileged user a range of sub-UIDs) and a userspace network/storage stack.
# Helpers + userspace networking
sudo apt-get update
sudo apt-get install -y uidmap slirp4netns dbus-user-session fuse-overlayfs
# Confirm you have a subordinate ID range allocated (created by adduser on modern distros)
grep "^$(whoami):" /etc/subuid /etc/subgid
# vinod:100000:65536 <- 65536 IDs starting at 100000, for both files
If those lines are missing, allocate a non-overlapping range:
sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 "$(whoami)"
Install and start
Use the official rootless installer, then run the daemon as a systemd user unit so it survives logout via lingering.
# Pull and run the rootless setup script (installs the rootless dockerd shim)
curl -fsSL https://get.docker.com/rootless | sh
# Persist PATH + socket for this user
export PATH=$HOME/bin:$PATH
export DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock
# Run the user-level daemon now and keep it running after logout
systemctl --user enable --now docker
sudo loginctl enable-linger "$(whoami)"
Verify the daemon is genuinely rootless and using the userspace drivers:
docker info -f 'rootless={{println .SecurityOptions}}storage={{.Driver}}'
# rootless=[name=seccomp,profile=builtin name=rootless name=cgroupns]
# storage=overlay2 (fuse-overlayfs on older kernels; native overlay2 on 5.13+)
On kernel 5.13+ rootless Docker can use native
overlay2withoutfuse-overlayfs, which removes a significant I/O penalty. Keepfuse-overlayfsinstalled as the fallback for older kernels but checkStorage Driverto confirm which you actually got.
Networking
By default rootless Docker uses slirp4netns for the container network, because an unprivileged user cannot create the host-side veth/bridge that rootful Docker uses. That is the cost of running without root. Outbound traffic and published ports work; raw performance and ICMP are limited. If you need throughput, install rootlesskit with bypass4netns or switch the port driver:
# ~/.config/systemd/user/docker.service.d/override.conf is one option, but the
# cleanest knob is the rootlesskit port driver. In ~/.config/docker/daemon.json
# you cannot set this; configure it via the service environment instead:
# DOCKERD_ROOTLESS_ROOTLESSKIT_PORT_DRIVER=slirp4netns
The key security property holds regardless of port driver: published ports are bound by your user, not by root.
3. Enable userns-remap on a rootful daemon
You cannot always run rootless — shared CI hosts, GPU passthrough, and some storage drivers still need a rootful daemon. The next best thing is user namespace remapping: the daemon stays rootful, but every container’s UID 0 is transparently mapped to a high, unprivileged host UID. Container root is no longer host root.
Configure the daemon to remap to a dedicated dockremap user:
{
"userns-remap": "default"
}
# /etc/docker/daemon.json contains the JSON above. "default" creates and uses
# the `dockremap` user/group and writes its ranges to /etc/subuid|/etc/subgid.
sudo systemctl restart docker
# Confirm the mapping is live
docker info -f '{{println .SecurityOptions}}'
# [name=seccomp,profile=builtin name=userns]
# Inside a container, root *appears* as UID 0...
docker run --rm alpine id
# uid=0(root) gid=0(root)
# ...but the host sees the remapped high UID owning that process
docker run -d --name probe alpine sleep 300
ps -o uid,cmd -C sleep
# UID CMD
# 100000 sleep 300 <- container root is host UID 100000, not 0
Storage implications you must plan for. Remapping changes the on-disk ownership model in two ways:
- Docker creates a separate storage root keyed by the map, e.g.
/var/lib/docker/100000.100000/. Images are not shared with the non-remapped daemon, so expect a one-time re-pull and extra disk. - Bind mounts are now owned, from the host’s perspective, by the remapped range. A volume that needs to be written by container root must be
chowned to the mapped UID (100000here) on the host, or the write fails withEACCES. This is the single most common operational surprise when adopting userns-remap.
# Make a host directory writable by remapped container root
sudo chown -R 100000:100000 /srv/appdata
docker run --rm -v /srv/appdata:/data alpine sh -c 'echo ok > /data/probe && cat /data/probe'
userns-remap is daemon-wide. You cannot remap some containers and not others on the same daemon, and a handful of features (
--privileged, certain--network host+ IPC combinations, and some external storage drivers) are incompatible. Validate your full workload set in staging before flipping it in production.
4. Drop all capabilities, add back only what you need
Even as remapped or rootless root, a container starts with a default capability set (CHOWN, DAC_OVERRIDE, FOWNER, SETUID, SETGID, NET_BIND_SERVICE, NET_RAW, and more). Most workloads need none of them after startup. Strip the lot and add back the minimum.
# A web service binding 8080 needs essentially nothing privileged
docker run --rm \
--cap-drop ALL \
--security-opt no-new-privileges \
myapp:latest
# A service that must bind 80/443 directly needs exactly one capability
docker run --rm \
--cap-drop ALL \
--cap-add NET_BIND_SERVICE \
--security-opt no-new-privileges \
nginx:stable
In Compose this belongs in every service definition, not as an afterthought:
services:
api:
image: myapp:latest
cap_drop: ["ALL"]
cap_add: ["NET_BIND_SERVICE"] # only if it binds < 1024
security_opt:
- "no-new-privileges:true"
read_only: true # immutable rootfs; pair with tmpfs for /tmp
tmpfs:
- /tmp
no-new-privileges is the cheap, high-value flag people forget: it sets the kernel PR_SET_NO_NEW_PRIVS bit so a setuid binary inside the container can never gain privilege it was not started with. With --cap-drop ALL it neutralises the classic “setuid helper to regain caps” escalation.
Find the real minimum empirically. Start with
--cap-drop ALL, run the workload’s full lifecycle, and add capabilities back one at a time only when you observe anEPERMthe app cannot tolerate. Most stateless services run clean on zero.
5. Author a custom seccomp profile
Docker’s default seccomp profile already blocks ~44 dangerous syscalls. A bespoke profile goes further: deny by default, allow only what the workload calls. Build it from observation, not guesswork.
Trace the real syscall surface
Run the container with seccomp unconfined but under strace (you need SYS_PTRACE for the trace, which you remove again afterwards), and collect the unique syscalls across the workload’s lifecycle — startup, steady state, graceful shutdown.
docker run --rm \
--security-opt seccomp=unconfined \
--cap-add SYS_PTRACE \
--entrypoint strace \
myapp:latest -f -c -qq /usr/local/bin/myapp 2>strace.out
# Extract the syscall column from the summary table
awk 'NR>2 && $NF ~ /^[a-z_]+$/ {print $NF}' strace.out | sort -u > syscalls.txt
Generate the profile
Start from Docker’s default profile (it has the correct architecture and header structure) and append your traced syscalls into a single allow rule. The skeleton of a deny-by-default profile:
{
"defaultAction": "SCMP_ACT_ERRNO",
"defaultErrnoRet": 1,
"archMap": [
{
"architecture": "SCMP_ARCH_X86_64",
"subArchitectures": ["SCMP_ARCH_X86", "SCMP_ARCH_X32"]
}
],
"syscalls": [
{
"names": [
"accept4", "bind", "brk", "close", "connect", "epoll_create1",
"epoll_ctl", "epoll_pwait", "exit_group", "fstat", "futex",
"getpid", "getrandom", "listen", "mmap", "mprotect", "munmap",
"nanosleep", "openat", "read", "rt_sigaction", "rt_sigprocmask",
"sendto", "set_robust_list", "setsockopt", "socket", "write"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
defaultAction: SCMP_ACT_ERRNO means any syscall not explicitly allowed returns an error rather than killing the process — easier to debug than SCMP_ACT_KILL, which terminates instantly. Apply it:
docker run --rm \
--security-opt seccomp=/path/to/myapp-seccomp.json \
--cap-drop ALL \
--security-opt no-new-privileges \
myapp:latest
Trace on the same kernel and libc you run in production. A glibc upgrade can swap
epoll_waitforepoll_pwait2, oropenforopenat, and an over-tight profile will then fail in production but pass in your old test image. Re-trace on base-image bumps and treat the profile as a versioned artifact next to the Dockerfile.
6. Write and load an AppArmor profile
Capabilities and seccomp gate operations and syscalls; AppArmor gates objects — which paths and sockets a confined process may touch. Docker ships a docker-default profile; a custom one lets you forbid, say, all writes outside /tmp and all raw network access for a workload that only speaks TCP.
Author a profile that confines a service to read its binary, write only /tmp and /var/run, and use TCP/UDP only:
# Save as /etc/apparmor.d/docker-myapp
#include <tunables/global>
profile docker-myapp flags=(attach_disconnected,mediate_deleted) {
#include <abstractions/base>
network inet tcp,
network inet udp,
network inet6 tcp,
network inet6 udp,
deny network raw,
deny network packet,
# Read-only application code
/usr/local/bin/myapp r,
/usr/local/lib/** mr,
# Writable scratch only
/tmp/ rw,
/tmp/** rw,
/var/run/ rw,
/var/run/** rw,
# Hard denies for classic breakout paths
deny /proc/sys/kernel/** w,
deny /sys/** w,
deny mount,
deny /** wl, # default-deny writes/links anywhere not allowed above
}
Load it into the kernel and run the container under it:
# Parse and load (replace -r when iterating)
sudo apparmor_parser -r -W /etc/apparmor.d/docker-myapp
# Confirm it is loaded
sudo aa-status | grep docker-myapp
# Run confined
docker run --rm \
--security-opt apparmor=docker-myapp \
--cap-drop ALL \
--security-opt no-new-privileges \
myapp:latest
The ordering of rules matters: AppArmor takes the most specific match, so the explicit /tmp/** rw wins over the trailing deny /** wl. Build the profile in complain mode first (flags=(complain) or aa-complain), exercise the workload, then read /var/log/syslog or journalctl -k for apparmor="ALLOWED" audit lines to discover legitimate accesses before switching to enforce.
AppArmor is path-based and Ubuntu/Debian-native; RHEL-family hosts use SELinux instead, which is label-based and configured through
--security-opt label=.... Pick the one your distro ships and enforces by default. Running neither is the failure mode to avoid.
Verify
Prove each layer with a focused privilege-escalation test battery. A hardened container should fail every one of these.
# 1. Container root must NOT be host root (userns-remap or rootless).
docker run -d --name esc --cap-drop ALL alpine sleep 600
ps -o uid,cmd -C sleep | grep sleep # expect a high UID (100000+), never 0
# 2. Privileged file ops should be denied without DAC_OVERRIDE.
docker run --rm --cap-drop ALL alpine \
sh -c 'touch /etc/cannot_write 2>&1 || echo "DENIED (good)"'
# 3. A blocked syscall must error under the seccomp profile.
docker run --rm --security-opt seccomp=/path/to/myapp-seccomp.json alpine \
sh -c 'unshare -U 2>&1 || echo "unshare DENIED (good)"'
# 4. Raw sockets blocked by AppArmor + dropped NET_RAW.
docker run --rm --cap-drop ALL --security-opt apparmor=docker-myapp alpine \
sh -c 'ping -c1 127.0.0.1 2>&1 || echo "raw socket DENIED (good)"'
# 5. no-new-privileges blocks setuid escalation.
docker run --rm --security-opt no-new-privileges --cap-drop ALL alpine \
sh -c 'echo "nnp active:"; cat /proc/self/status | grep NoNewPrivs'
# NoNewPrivs: 1 <- escalation via setuid binaries is impossible
# 6. The host docker socket must never be reachable from a workload.
docker run --rm alpine sh -c 'ls /var/run/docker.sock 2>&1 || echo "no socket (good)"'
docker rm -f esc
If any test succeeds where it should be denied, that layer is misconfigured — most often a stray --privileged, a forgotten --cap-add, or a profile that did not load.
Trade-offs
Hardening is not free; budget for these before you roll it out fleet-wide.
| Decision | Cost | When it bites |
|---|---|---|
| Rootless networking (slirp4netns) | Lower throughput, no native ICMP, NAT overhead | High-PPS or latency-sensitive services; use bypass4netns |
| Ports below 1024 rootless | Unprivileged users cannot bind <1024 | Bind high and front with a host reverse proxy, or set net.ipv4.ip_unprivileged_port_start |
| Storage drivers | Rootless prefers overlay2 (5.13+) or fuse-overlayfs; devicemapper/some volume plugins unsupported |
GPU, FUSE-heavy, or vendor-storage workloads |
| userns-remap disk | Separate /var/lib/docker/<map> storage root; images re-pulled, bind mounts need chown |
First rollout; plan disk + a maintenance window |
| Over-tight seccomp/AppArmor | Production-only failures after libc/kernel bumps | Treat profiles as versioned artifacts; re-trace on base-image changes |
The honest summary: rootless plus capability dropping is the highest security-per-effort and should be the default for stateless services. userns-remap is the pragmatic answer when you must keep a rootful daemon. Custom seccomp and AppArmor profiles are worth the authoring cost for your crown-jewel workloads but are overkill applied blindly to everything — start with the hardened defaults and tighten where the blast radius justifies it.
Enterprise scenario
A fintech platform team ran a multi-tenant GitLab CI fleet where every runner exposed /var/run/docker.sock into build containers so jobs could run docker build. A red-team exercise broke out trivially: any pipeline could mount a host path through the shared socket and read another tenant’s checked-out secrets. The socket was the host.
They could not go fully rootless overnight — some jobs needed buildx with a specific storage driver — so they staged it. First, they killed socket mounting entirely and moved image builds to rootless BuildKit running as a sidecar per job, so each build ran inside the job’s own user namespace with no host-root daemon anywhere in the path. For the residual rootful runners (GPU integration tests), they enabled userns-remap and pinned a per-runner subordinate range so tenants could not collide on host UIDs.
# .gitlab-ci.yml — rootless image build, no docker socket, no privileged flag
build:
image: moby/buildkit:rootless
variables:
BUILDKITD_FLAGS: --oci-worker-no-process-sandbox
script:
- mkdir -p ~/.docker && echo '{}' > ~/.docker/config.json
- buildctl-daemonless.sh build
--frontend dockerfile.v0
--local context=.
--local dockerfile=.
--output type=image,name=registry.acme.io/app:$CI_COMMIT_SHA,push=true
The constraint that drove the design was tenant isolation under a shared runner pool, and the fix was to ensure no privileged daemon was ever reachable from tenant code. After rollout, the same red-team breakout returned permission denied at the namespace boundary — the build no longer had a host-root socket to abuse, and the GPU runners’ remapping meant a container escape landed on an unprivileged, per-runner UID with nothing to steal.