Containerization Multi-cloud

Deploy Talos Linux Immutable Kubernetes Nodes with Cluster API

A fintech platform team is tired of pet nodes. Their Kubernetes fleet runs on a mix of bare-metal in a colo and VMs on vSphere, and every node is a slightly different snowflake: someone SSH’d in last quarter to patch a CVE, the kubelet flags drifted, and a security audit found an interactive shell, a package manager, and sixty unaccounted-for packages on every host. The auditors flagged it; the CISO wants nodes that cannot be logged into, cannot drift, and can be rebuilt identically from a Git commit. The ask is concrete: replace SSH-managed Ubuntu nodes with Talos Linux — a minimal, immutable, API-only OS purpose-built for Kubernetes — and manage the whole node lifecycle declaratively through Cluster API (CAPI), so a node is a reconciled resource, not a server someone tends. This guide walks the full build: a CAPI management cluster, a Talos-backed workload cluster, GitOps for both, and the identity, secrets, security, and observability tooling a regulated shop needs around it.

Talos earns its place here because it removes the attack surface that started the audit. There is no SSH, no shell, no systemd, no apt/yum, no /bin/bash — the host exposes only a gRPC API (talosctl) secured by mutual TLS. Configuration is a single declarative YAML machine config; you cannot “fix one node by hand,” because there is no hand to fix it with. Cluster API then treats that machine config as the desired state and reconciles Machine, MachineDeployment, and KubeadmControlPlane-equivalent objects into real, immutable nodes. Together they turn “patch the fleet” into “change a value and let the controller roll it.”

Prerequisites

Target topology

Deploy Talos Linux Immutable Kubernetes Nodes with Cluster API — topology

The build has two clusters and one Git repo. A small, long-lived management cluster runs the Cluster API controllers (core CAPI, the vSphere infra provider, and the Talos bootstrap + control-plane providers). It owns the lifecycle of one or more workload clusters, each made entirely of Talos Linux nodes — three control-plane machines behind a VIP and a MachineDeployment of worker machines. Everything that defines a cluster — the CAPI manifests, the Talos machine config patches, and the workload addons — lives in Git and is reconciled by Argo CD running on the management cluster. Around the edges sit the enterprise services: Okta/Entra ID federates human access to both talosctl and kubectl; HashiCorp Vault issues the secrets and PKI; Wiz/Wiz Code scans posture and IaC; CrowdStrike Falcon watches runtime; Dynatrace/Datadog observes; ServiceNow gates changes; Jenkins/GitHub Actions runs CI; Terraform/Ansible provisions the substrate; and Akamai fronts public ingress. The defining property is immutability: a node is never modified in place — to change it, you change its config in Git and CAPI rolls a replacement.

1. Provision the substrate with Terraform

Talos nodes need somewhere to boot. Use Terraform to stand up the management cluster’s host (or kind on a builder), the workload-cluster networks, the control-plane VIP reservation, and DNS — keeping the substrate itself declarative and reviewable. Wiz Code scans this IaC in the pull request for misconfigurations (open security groups, public buckets) before it ever applies.

# infra/vsphere.tf — network + VIP reservation for the workload cluster
resource "vsphere_virtual_machine" "talos_template" {
  name             = "talos-v1.9.2-template"
  resource_pool_id = data.vsphere_resource_pool.pool.id
  datastore_id     = data.vsphere_datastore.ds.id
  num_cpus         = 4
  memory           = 8192
  guest_id         = "other5xLinux64Guest"
  # OVA built from the official Talos vSphere image (factory.talos.dev)
  ovf_deploy { remote_ovf_url = var.talos_ova_url }
}

output "control_plane_vip" { value = "10.20.0.10" }
cd infra
terraform init
terraform plan -out tf.plan      # Wiz Code gate runs here in CI
terraform apply tf.plan

For substrate that Terraform does not cover well (BIOS/iPXE config on physical hosts, switch ports), Ansible playbooks handle the one-time hardware prep. Note what Ansible does not do here: it never touches a running Talos node — Talos has no SSH for it to reach. Ansible’s job ends at “the machine can iPXE-boot the Talos installer.”

2. Bring up a kind bootstrap cluster and install Cluster API

Cluster API controllers have to run somewhere before your real management cluster exists. Stand up a throwaway kind cluster, then use clusterctl to install core CAPI plus the vSphere and Talos providers into it. The init call is where you wire the three Talos-specific providers.

kind create cluster --name capi-bootstrap

# Tell clusterctl about the Talos providers (community provider list).
export INFRASTRUCTURE_VSPHERE_VERSION=v1.11.0
clusterctl init \
  --infrastructure vsphere \
  --bootstrap talos \
  --control-plane talos

Confirm every controller is Running before going further:

kubectl get pods -A | grep -E 'capi|capv|cabpt|cacppt'
# capi-system                 capi-controller-manager-...            1/1 Running
# capv-system                 capv-controller-manager-...            1/1 Running
# cabpt-system                cabpt-controller-manager-...           1/1 Running   (Talos bootstrap)
# cacppt-system               cacppt-controller-manager-...          1/1 Running   (Talos control plane)

3. Generate the Talos machine configuration

Talos is configured by a declarative machine config. Generate a base config for the workload cluster, pinned to the control-plane VIP and a specific Kubernetes version. Treat secrets.yaml as Vault-grade material (see Security): it holds the cluster CA and bootstrap tokens.

talosctl gen secrets -o secrets.yaml          # cluster PKI/bootstrap secrets

talosctl gen config fintech-prod https://10.20.0.10:6443 \
  --with-secrets secrets.yaml \
  --kubernetes-version 1.30.4 \
  --install-disk /dev/sda \
  --config-patch @patches/hardening.yaml

The hardening.yaml patch is where the audit findings get answered declaratively — it locks down the kubelet, forces the immutable rootfs, and enables KubeSpan/audit logging. Because it is a file in Git, the control proves itself:

# patches/hardening.yaml — applied to every node, no exceptions
machine:
  kubelet:
    extraArgs:
      rotate-server-certificates: "true"
  features:
    rbac: true
    kubePrism:
      enabled: true                 # in-cluster API load balancing
  install:
    wipe: true                      # immutable: every boot is from a clean image
cluster:
  apiServer:
    auditPolicy:
      apiVersion: audit.k8s.io/v1
      kind: Policy
      rules: [{ level: RequestResponse }]

4. Define the Cluster API resources for a Talos cluster

Now describe the workload cluster as CAPI objects. The key wiring: the TalosControlPlane and TalosConfigTemplate reference the machine config from step 3, and the vSphere VSphereMachineTemplate says what hardware each node gets. This single manifest is the desired state CAPI reconciles.

apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
  name: fintech-prod-cp
spec:
  replicas: 3
  version: v1.30.4
  controlPlaneConfig:
    controlplane:
      generateType: controlplane    # uses Talos controlplane machine config
  infrastructureTemplate:
    kind: VSphereMachineTemplate
    name: fintech-prod-cp-vsphere
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: fintech-prod-md-0
spec:
  clusterName: fintech-prod
  replicas: 5
  template:
    spec:
      version: v1.30.4
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
          kind: TalosConfigTemplate
          name: fintech-prod-workers
      infrastructureRef:
        kind: VSphereMachineTemplate
        name: fintech-prod-worker-vsphere

Commit this to Git rather than applying it by hand — step 7 is where Argo CD makes it real.

5. Apply, then pivot the management plane into Talos

Apply the cluster manifest to the bootstrap kind cluster and watch CAPI create real Talos machines. Once nodes register, pivot: move the CAPI controllers from the throwaway kind cluster into a permanent Talos-based management cluster, so even your control plane runs on immutable infrastructure.

kubectl apply -f clusters/fintech-prod/      # bootstrap cluster reconciles it
clusterctl describe cluster fintech-prod     # watch Machines go Provisioning -> Running

# Fetch the new cluster's admin kubeconfig + talosconfig
clusterctl get kubeconfig fintech-prod > fintech-prod.kubeconfig
talosctl kubeconfig --nodes 10.20.0.11 ./   # any control-plane node IP

# Pivot CAPI state from kind into the real management cluster, then delete kind
clusterctl move --to-kubeconfig=mgmt.kubeconfig
kind delete cluster --name capi-bootstrap

Bootstrap the very first control-plane node so etcd forms (Talos waits for an explicit bootstrap, by design):

talosctl bootstrap --nodes 10.20.0.11 --talosconfig ./talosconfig

6. Wire identity for humans: Okta/Entra to talosctl and kubectl

Neither talosctl nor kubectl should use long-lived static credentials for people. Federate both to your IdP. Okta (or Microsoft Entra ID) is the workforce identity provider; engineers authenticate once and receive a short-lived OIDC token, so there are no shared admin certs to leak and access is revoked centrally when someone leaves.

# kube-apiserver OIDC, set via the Talos cluster machine config patch
cluster:
  apiServer:
    extraArgs:
      oidc-issuer-url: "https://kloudvin.okta.com/oauth2/default"
      oidc-client-id: "0oaXXXXkube"
      oidc-username-claim: "email"
      oidc-groups-claim: "groups"
# Engineers log in through the IdP; the plugin handles the token exchange.
kubectl oidc-login setup \
  --oidc-issuer-url=https://kloudvin.okta.com/oauth2/default \
  --oidc-client-id=0oaXXXXkube

RBAC then binds Okta/Entra group claims (not individuals) to roles, so platform engineers get talosctl machine-config rights and app teams get namespaced kubectl access only. talosctl access itself is mutual-TLS; issue those client certs from Vault (next step) rather than the static talosconfig.

7. GitOps the whole thing with Argo CD

Install Argo CD on the management cluster and point it at the Git repo holding both the CAPI manifests and the workload addons. This closes the loop: a node’s existence, version, and config are now reconciled from Git, and a human SSH’ing to “fix” something is not just discouraged — it is impossible on Talos, so Git is the only path. Jenkins or GitHub Actions runs CI on that repo (lint, kubeconform, policy checks, Wiz Code scan) before changes merge.

kubectl create namespace argocd
helm install argocd argo/argo-cd -n argocd --set configs.params."server\.insecure"=false

# App-of-apps: one Argo Application that owns the cluster + its addons
argocd app create fintech-prod-cluster \
  --repo https://git.kloudvin.com/platform/clusters.git \
  --path clusters/fintech-prod \
  --dest-server https://kubernetes.default.svc \
  --sync-policy automated --self-heal --auto-prune

A node upgrade is now a pull request that bumps version: v1.30.4 to v1.31.x; on merge, Argo syncs the manifest, CAPI rolls the MachineDeployment one node at a time, and each new node boots a fresh immutable image. ServiceNow sits in front of production syncs as the change gate — the Argo sync for the prod cluster requires an approved change record, so security and ops have a documented, auditable approval, not just a Git push.

8. Layer secrets, security, and observability onto the cluster

With clusters reconciling from Git, install the enterprise agents as Argo-managed addons. Each tool has one concrete job here:

# HashiCorp Vault: dynamic secrets + the PKI that issues talosctl/kubelet certs
helm install vault hashicorp/vault -n vault \
  --set "injector.enabled=true" \
  --set "server.ha.enabled=true"
# Workloads get short-lived secrets via the Vault Agent sidecar; talosctl client
# certs are issued from a Vault PKI role, so no static admin cert is ever stored.

# CrowdStrike Falcon: runtime threat detection on every node + container
helm install falcon-sensor crowdstrike/falcon-sensor -n falcon-system \
  --set falcon.cid=$FALCON_CID
# Falcon runs as a DaemonSet; on an SSH-less OS, runtime EDR is how you'd
# even notice an in-container compromise. Detections feed the SOC.

# Dynatrace (or Datadog): full-stack observability + tracing
helm install dynatrace-operator dynatrace/dynatrace-operator -n dynatrace \
  --set apiUrl=$DT_API_URL
# OneAgent collects node/pod/trace telemetry; Davis flags anomalies. Swap for
# the Datadog Agent + cluster-agent if that's the house standard.

Posture is continuous, not a point check: Wiz scans the live cluster and cloud account for misconfigurations and attack paths, while Wiz Code has already gated the Terraform and Kubernetes manifests in CI — together they assert that the immutability and least-privilege controls actually hold in production, not just on paper. For public-facing workloads, Akamai terminates TLS and provides WAF/anycast at the edge in front of the cluster’s ingress, so raw node IPs are never exposed. If you run an internal Moodle for the platform team’s runbooks and Talos/CAPI training, deploy it as just another Argo-managed app on a worker MachineDeployment — proof that the same immutable substrate carries ordinary stateful apps. Legacy virtual appliances that cannot be containerized (a hardware-tied load balancer, an old IDS) stay on the vSphere substrate beside the cluster and are wired in at the network layer, since you cannot install them onto a closed Talos node.

Validation

Prove the cluster is healthy, immutable, and actually un-loggable-into.

# 1. All CAPI machines reconciled and Running
clusterctl describe cluster fintech-prod
kubectl get machines -o wide        # every Machine should be Running

# 2. Kubernetes nodes Ready, all on Talos
kubectl --kubeconfig fintech-prod.kubeconfig get nodes -o wide
# OS-IMAGE column reads "Talos (v1.9.2)" on every node

# 3. Talos health (etcd quorum, services, control-plane)
talosctl --nodes 10.20.0.11,10.20.0.12,10.20.0.13 health

# 4. Prove immutability: there is no shell to exec into
talosctl --nodes 10.20.0.11 list /bin    # minimal; no bash, no apt, no ssh
ssh 10.20.0.11                           # connection refused — by design

# 5. Argo CD reports the cluster app Synced/Healthy
argocd app get fintech-prod-cluster

A green run here is the audit answer: nodes are Talos, reconciled by CAPI from Git, with no interactive access path.

Rollback and teardown

Because every node is disposable and the state lives in Git and CAPI, rollback is a controlled operation, not a rescue mission.

# Roll back a bad node upgrade: revert the version bump in Git; Argo + CAPI
# replace the rolled nodes with the previous immutable image automatically.
git revert <bad-commit> && git push    # Argo auto-syncs; MachineDeployment rolls back

# Drain and replace a single suspect node (CAPI provisions a fresh one)
kubectl delete machine fintech-prod-md-0-abc12   # controller creates a replacement

# Full teardown of a workload cluster (from the management cluster)
kubectl delete cluster fintech-prod    # CAPI deprovisions all machines + infra

# Tear down the management plane last
clusterctl delete --all
terraform -chdir=infra destroy

Never “fix” a node by reverting it in place — there is nothing to revert into. The correct rollback is always “replace with the known-good image,” which is exactly what git revert + CAPI does.

Common pitfalls

Security notes

The whole point is reduced attack surface, so do not undo it. Talos has no SSH, no shell, and a read-only immutable rootfs — keep it that way; never enable a debug shell in production. Human access to talosctl and kubectl federates through Okta/Entra ID with short-lived OIDC tokens and group-based RBAC, so there are no shared admin certs. The Talos secrets.yaml (cluster CA, bootstrap tokens) and all talosctl client certs are issued and leased from HashiCorp Vault PKI rather than committed anywhere. CrowdStrike Falcon provides the runtime EDR that an SSH-less host still needs at the container layer, feeding detections to the SOC. Wiz continuously verifies posture on the live cluster while Wiz Code gates the Terraform and manifests in CI, so the immutability and least-privilege guarantees are independently checked. Enable the Kubernetes audit policy (step 3) and ship those logs to Dynatrace/Datadog. Production changes pass a ServiceNow change gate before Argo syncs them.

Cost notes

Immutable infrastructure is a cost lever, not just a security one. Talos is free and open source with a tiny footprint (~80 MB), so it runs the same node on smaller VMs than a full Ubuntu image and packs more pods per host. Cluster API standardizes node images, which kills the snowflake-driven over-provisioning where each team sized hosts defensively. Rebuild-don’t-patch means no maintenance windows and no patch labor — upgrades are a PR, not a night of SSH. Right-size with MachineDeployment replicas tied to real utilization and let the cluster autoscaler adjust workers. Watch the spend on the surrounding commercial tools — CrowdStrike, Dynatrace/Datadog, Wiz, and Vault Enterprise are typically priced per node/host/workload, so a fleet that scales horizontally scales those bills too; meter them per cluster in your observability tool and charge back to the teams that drive the node count.

Talos LinuxCluster APIKubernetesImmutable InfrastructureGitOpsBare Metal
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading