A fintech platform team is tired of pet nodes. Their Kubernetes fleet runs on a mix of bare-metal in a colo and VMs on vSphere, and every node is a slightly different snowflake: someone SSH’d in last quarter to patch a CVE, the kubelet flags drifted, and a security audit found an interactive shell, a package manager, and sixty unaccounted-for packages on every host. The auditors flagged it; the CISO wants nodes that cannot be logged into, cannot drift, and can be rebuilt identically from a Git commit. The ask is concrete: replace SSH-managed Ubuntu nodes with Talos Linux — a minimal, immutable, API-only OS purpose-built for Kubernetes — and manage the whole node lifecycle declaratively through Cluster API (CAPI), so a node is a reconciled resource, not a server someone tends. This guide walks the full build: a CAPI management cluster, a Talos-backed workload cluster, GitOps for both, and the identity, secrets, security, and observability tooling a regulated shop needs around it.
Talos earns its place here because it removes the attack surface that started the audit. There is no SSH, no shell, no systemd, no apt/yum, no /bin/bash — the host exposes only a gRPC API (talosctl) secured by mutual TLS. Configuration is a single declarative YAML machine config; you cannot “fix one node by hand,” because there is no hand to fix it with. Cluster API then treats that machine config as the desired state and reconciles Machine, MachineDeployment, and KubeadmControlPlane-equivalent objects into real, immutable nodes. Together they turn “patch the fleet” into “change a value and let the controller roll it.”
Prerequisites
- A bootstrap/management environment: a Linux workstation or a small kind cluster (Docker 24+, 8 GB RAM) to host the CAPI controllers initially, then a permanent management cluster to pivot into.
- CLI tools:
kubectl1.30+,talosctlv1.9+,clusterctlv1.9+,kindv0.24+,helm3.15+,argocdCLI, andterraform1.9+. - Infrastructure target for this guide: bare-metal/vSphere via the CAPI providers
infrastructure-vsphere(CAPV) and the Talos providersbootstrap-talos(CABPT) +control-plane-talos(CACPPT). The same flow applies to AWS (CAPA) or Azure (CAPZ) by swapping the infra provider. - A reachable control-plane VIP (a free IP for the API server, e.g.
10.20.0.10) and DHCP or static IPs for nodes. - An OCI registry you control (Harbor/ECR/ACR) and a Git repo for the GitOps source of truth.
- Out of band: an Okta or Microsoft Entra ID tenant for human SSO, a HashiCorp Vault instance reachable from the management cluster, and admin rights to install agents (CrowdStrike, Dynatrace).
Target topology
The build has two clusters and one Git repo. A small, long-lived management cluster runs the Cluster API controllers (core CAPI, the vSphere infra provider, and the Talos bootstrap + control-plane providers). It owns the lifecycle of one or more workload clusters, each made entirely of Talos Linux nodes — three control-plane machines behind a VIP and a MachineDeployment of worker machines. Everything that defines a cluster — the CAPI manifests, the Talos machine config patches, and the workload addons — lives in Git and is reconciled by Argo CD running on the management cluster. Around the edges sit the enterprise services: Okta/Entra ID federates human access to both talosctl and kubectl; HashiCorp Vault issues the secrets and PKI; Wiz/Wiz Code scans posture and IaC; CrowdStrike Falcon watches runtime; Dynatrace/Datadog observes; ServiceNow gates changes; Jenkins/GitHub Actions runs CI; Terraform/Ansible provisions the substrate; and Akamai fronts public ingress. The defining property is immutability: a node is never modified in place — to change it, you change its config in Git and CAPI rolls a replacement.
1. Provision the substrate with Terraform
Talos nodes need somewhere to boot. Use Terraform to stand up the management cluster’s host (or kind on a builder), the workload-cluster networks, the control-plane VIP reservation, and DNS — keeping the substrate itself declarative and reviewable. Wiz Code scans this IaC in the pull request for misconfigurations (open security groups, public buckets) before it ever applies.
# infra/vsphere.tf — network + VIP reservation for the workload cluster
resource "vsphere_virtual_machine" "talos_template" {
name = "talos-v1.9.2-template"
resource_pool_id = data.vsphere_resource_pool.pool.id
datastore_id = data.vsphere_datastore.ds.id
num_cpus = 4
memory = 8192
guest_id = "other5xLinux64Guest"
# OVA built from the official Talos vSphere image (factory.talos.dev)
ovf_deploy { remote_ovf_url = var.talos_ova_url }
}
output "control_plane_vip" { value = "10.20.0.10" }
cd infra
terraform init
terraform plan -out tf.plan # Wiz Code gate runs here in CI
terraform apply tf.plan
For substrate that Terraform does not cover well (BIOS/iPXE config on physical hosts, switch ports), Ansible playbooks handle the one-time hardware prep. Note what Ansible does not do here: it never touches a running Talos node — Talos has no SSH for it to reach. Ansible’s job ends at “the machine can iPXE-boot the Talos installer.”
2. Bring up a kind bootstrap cluster and install Cluster API
Cluster API controllers have to run somewhere before your real management cluster exists. Stand up a throwaway kind cluster, then use clusterctl to install core CAPI plus the vSphere and Talos providers into it. The init call is where you wire the three Talos-specific providers.
kind create cluster --name capi-bootstrap
# Tell clusterctl about the Talos providers (community provider list).
export INFRASTRUCTURE_VSPHERE_VERSION=v1.11.0
clusterctl init \
--infrastructure vsphere \
--bootstrap talos \
--control-plane talos
Confirm every controller is Running before going further:
kubectl get pods -A | grep -E 'capi|capv|cabpt|cacppt'
# capi-system capi-controller-manager-... 1/1 Running
# capv-system capv-controller-manager-... 1/1 Running
# cabpt-system cabpt-controller-manager-... 1/1 Running (Talos bootstrap)
# cacppt-system cacppt-controller-manager-... 1/1 Running (Talos control plane)
3. Generate the Talos machine configuration
Talos is configured by a declarative machine config. Generate a base config for the workload cluster, pinned to the control-plane VIP and a specific Kubernetes version. Treat secrets.yaml as Vault-grade material (see Security): it holds the cluster CA and bootstrap tokens.
talosctl gen secrets -o secrets.yaml # cluster PKI/bootstrap secrets
talosctl gen config fintech-prod https://10.20.0.10:6443 \
--with-secrets secrets.yaml \
--kubernetes-version 1.30.4 \
--install-disk /dev/sda \
--config-patch @patches/hardening.yaml
The hardening.yaml patch is where the audit findings get answered declaratively — it locks down the kubelet, forces the immutable rootfs, and enables KubeSpan/audit logging. Because it is a file in Git, the control proves itself:
# patches/hardening.yaml — applied to every node, no exceptions
machine:
kubelet:
extraArgs:
rotate-server-certificates: "true"
features:
rbac: true
kubePrism:
enabled: true # in-cluster API load balancing
install:
wipe: true # immutable: every boot is from a clean image
cluster:
apiServer:
auditPolicy:
apiVersion: audit.k8s.io/v1
kind: Policy
rules: [{ level: RequestResponse }]
4. Define the Cluster API resources for a Talos cluster
Now describe the workload cluster as CAPI objects. The key wiring: the TalosControlPlane and TalosConfigTemplate reference the machine config from step 3, and the vSphere VSphereMachineTemplate says what hardware each node gets. This single manifest is the desired state CAPI reconciles.
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
name: fintech-prod-cp
spec:
replicas: 3
version: v1.30.4
controlPlaneConfig:
controlplane:
generateType: controlplane # uses Talos controlplane machine config
infrastructureTemplate:
kind: VSphereMachineTemplate
name: fintech-prod-cp-vsphere
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: fintech-prod-md-0
spec:
clusterName: fintech-prod
replicas: 5
template:
spec:
version: v1.30.4
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
kind: TalosConfigTemplate
name: fintech-prod-workers
infrastructureRef:
kind: VSphereMachineTemplate
name: fintech-prod-worker-vsphere
Commit this to Git rather than applying it by hand — step 7 is where Argo CD makes it real.
5. Apply, then pivot the management plane into Talos
Apply the cluster manifest to the bootstrap kind cluster and watch CAPI create real Talos machines. Once nodes register, pivot: move the CAPI controllers from the throwaway kind cluster into a permanent Talos-based management cluster, so even your control plane runs on immutable infrastructure.
kubectl apply -f clusters/fintech-prod/ # bootstrap cluster reconciles it
clusterctl describe cluster fintech-prod # watch Machines go Provisioning -> Running
# Fetch the new cluster's admin kubeconfig + talosconfig
clusterctl get kubeconfig fintech-prod > fintech-prod.kubeconfig
talosctl kubeconfig --nodes 10.20.0.11 ./ # any control-plane node IP
# Pivot CAPI state from kind into the real management cluster, then delete kind
clusterctl move --to-kubeconfig=mgmt.kubeconfig
kind delete cluster --name capi-bootstrap
Bootstrap the very first control-plane node so etcd forms (Talos waits for an explicit bootstrap, by design):
talosctl bootstrap --nodes 10.20.0.11 --talosconfig ./talosconfig
6. Wire identity for humans: Okta/Entra to talosctl and kubectl
Neither talosctl nor kubectl should use long-lived static credentials for people. Federate both to your IdP. Okta (or Microsoft Entra ID) is the workforce identity provider; engineers authenticate once and receive a short-lived OIDC token, so there are no shared admin certs to leak and access is revoked centrally when someone leaves.
# kube-apiserver OIDC, set via the Talos cluster machine config patch
cluster:
apiServer:
extraArgs:
oidc-issuer-url: "https://kloudvin.okta.com/oauth2/default"
oidc-client-id: "0oaXXXXkube"
oidc-username-claim: "email"
oidc-groups-claim: "groups"
# Engineers log in through the IdP; the plugin handles the token exchange.
kubectl oidc-login setup \
--oidc-issuer-url=https://kloudvin.okta.com/oauth2/default \
--oidc-client-id=0oaXXXXkube
RBAC then binds Okta/Entra group claims (not individuals) to roles, so platform engineers get talosctl machine-config rights and app teams get namespaced kubectl access only. talosctl access itself is mutual-TLS; issue those client certs from Vault (next step) rather than the static talosconfig.
7. GitOps the whole thing with Argo CD
Install Argo CD on the management cluster and point it at the Git repo holding both the CAPI manifests and the workload addons. This closes the loop: a node’s existence, version, and config are now reconciled from Git, and a human SSH’ing to “fix” something is not just discouraged — it is impossible on Talos, so Git is the only path. Jenkins or GitHub Actions runs CI on that repo (lint, kubeconform, policy checks, Wiz Code scan) before changes merge.
kubectl create namespace argocd
helm install argocd argo/argo-cd -n argocd --set configs.params."server\.insecure"=false
# App-of-apps: one Argo Application that owns the cluster + its addons
argocd app create fintech-prod-cluster \
--repo https://git.kloudvin.com/platform/clusters.git \
--path clusters/fintech-prod \
--dest-server https://kubernetes.default.svc \
--sync-policy automated --self-heal --auto-prune
A node upgrade is now a pull request that bumps version: v1.30.4 to v1.31.x; on merge, Argo syncs the manifest, CAPI rolls the MachineDeployment one node at a time, and each new node boots a fresh immutable image. ServiceNow sits in front of production syncs as the change gate — the Argo sync for the prod cluster requires an approved change record, so security and ops have a documented, auditable approval, not just a Git push.
8. Layer secrets, security, and observability onto the cluster
With clusters reconciling from Git, install the enterprise agents as Argo-managed addons. Each tool has one concrete job here:
# HashiCorp Vault: dynamic secrets + the PKI that issues talosctl/kubelet certs
helm install vault hashicorp/vault -n vault \
--set "injector.enabled=true" \
--set "server.ha.enabled=true"
# Workloads get short-lived secrets via the Vault Agent sidecar; talosctl client
# certs are issued from a Vault PKI role, so no static admin cert is ever stored.
# CrowdStrike Falcon: runtime threat detection on every node + container
helm install falcon-sensor crowdstrike/falcon-sensor -n falcon-system \
--set falcon.cid=$FALCON_CID
# Falcon runs as a DaemonSet; on an SSH-less OS, runtime EDR is how you'd
# even notice an in-container compromise. Detections feed the SOC.
# Dynatrace (or Datadog): full-stack observability + tracing
helm install dynatrace-operator dynatrace/dynatrace-operator -n dynatrace \
--set apiUrl=$DT_API_URL
# OneAgent collects node/pod/trace telemetry; Davis flags anomalies. Swap for
# the Datadog Agent + cluster-agent if that's the house standard.
Posture is continuous, not a point check: Wiz scans the live cluster and cloud account for misconfigurations and attack paths, while Wiz Code has already gated the Terraform and Kubernetes manifests in CI — together they assert that the immutability and least-privilege controls actually hold in production, not just on paper. For public-facing workloads, Akamai terminates TLS and provides WAF/anycast at the edge in front of the cluster’s ingress, so raw node IPs are never exposed. If you run an internal Moodle for the platform team’s runbooks and Talos/CAPI training, deploy it as just another Argo-managed app on a worker MachineDeployment — proof that the same immutable substrate carries ordinary stateful apps. Legacy virtual appliances that cannot be containerized (a hardware-tied load balancer, an old IDS) stay on the vSphere substrate beside the cluster and are wired in at the network layer, since you cannot install them onto a closed Talos node.
Validation
Prove the cluster is healthy, immutable, and actually un-loggable-into.
# 1. All CAPI machines reconciled and Running
clusterctl describe cluster fintech-prod
kubectl get machines -o wide # every Machine should be Running
# 2. Kubernetes nodes Ready, all on Talos
kubectl --kubeconfig fintech-prod.kubeconfig get nodes -o wide
# OS-IMAGE column reads "Talos (v1.9.2)" on every node
# 3. Talos health (etcd quorum, services, control-plane)
talosctl --nodes 10.20.0.11,10.20.0.12,10.20.0.13 health
# 4. Prove immutability: there is no shell to exec into
talosctl --nodes 10.20.0.11 list /bin # minimal; no bash, no apt, no ssh
ssh 10.20.0.11 # connection refused — by design
# 5. Argo CD reports the cluster app Synced/Healthy
argocd app get fintech-prod-cluster
A green run here is the audit answer: nodes are Talos, reconciled by CAPI from Git, with no interactive access path.
Rollback and teardown
Because every node is disposable and the state lives in Git and CAPI, rollback is a controlled operation, not a rescue mission.
# Roll back a bad node upgrade: revert the version bump in Git; Argo + CAPI
# replace the rolled nodes with the previous immutable image automatically.
git revert <bad-commit> && git push # Argo auto-syncs; MachineDeployment rolls back
# Drain and replace a single suspect node (CAPI provisions a fresh one)
kubectl delete machine fintech-prod-md-0-abc12 # controller creates a replacement
# Full teardown of a workload cluster (from the management cluster)
kubectl delete cluster fintech-prod # CAPI deprovisions all machines + infra
# Tear down the management plane last
clusterctl delete --all
terraform -chdir=infra destroy
Never “fix” a node by reverting it in place — there is nothing to revert into. The correct rollback is always “replace with the known-good image,” which is exactly what git revert + CAPI does.
Common pitfalls
- Bootstrapping etcd too early or twice.
talosctl bootstrapmust run on exactly one control-plane node, once. Running it on multiple nodes splits etcd. Wait for the first node to be reachable, bootstrap it alone, then let the rest join. - Forgetting the install disk. If
--install-diskdoes not match the node’s actual disk (/dev/sdavs/dev/nvme0n1), nodes boot the installer forever. Confirm withtalosctl get disksagainst a node in maintenance mode. - Editing config by hand instead of via Git.
talosctl apply-configworks, but any out-of-band change is drift Argo/CAPI will eventually fight or overwrite. Make every change a commit. - Wrong provider versions. The Talos bootstrap/control-plane providers track specific CAPI contract versions. A
clusterctlcore upgrade without matching CABPT/CACPPT versions breaks reconciliation — pin all of them. - VIP not actually highly available. If the control-plane VIP is a single static IP with no failover, losing that node loses the API. Use Talos’s built-in VIP or an external LB; enable
kubePrismfor in-cluster API resilience. - Skipping
clusterctl movebefore deleting kind. Delete the bootstrap kind cluster before pivoting and you orphan the workload cluster’s CAPI state. Alwaysmovefirst.
Security notes
The whole point is reduced attack surface, so do not undo it. Talos has no SSH, no shell, and a read-only immutable rootfs — keep it that way; never enable a debug shell in production. Human access to talosctl and kubectl federates through Okta/Entra ID with short-lived OIDC tokens and group-based RBAC, so there are no shared admin certs. The Talos secrets.yaml (cluster CA, bootstrap tokens) and all talosctl client certs are issued and leased from HashiCorp Vault PKI rather than committed anywhere. CrowdStrike Falcon provides the runtime EDR that an SSH-less host still needs at the container layer, feeding detections to the SOC. Wiz continuously verifies posture on the live cluster while Wiz Code gates the Terraform and manifests in CI, so the immutability and least-privilege guarantees are independently checked. Enable the Kubernetes audit policy (step 3) and ship those logs to Dynatrace/Datadog. Production changes pass a ServiceNow change gate before Argo syncs them.
Cost notes
Immutable infrastructure is a cost lever, not just a security one. Talos is free and open source with a tiny footprint (~80 MB), so it runs the same node on smaller VMs than a full Ubuntu image and packs more pods per host. Cluster API standardizes node images, which kills the snowflake-driven over-provisioning where each team sized hosts defensively. Rebuild-don’t-patch means no maintenance windows and no patch labor — upgrades are a PR, not a night of SSH. Right-size with MachineDeployment replicas tied to real utilization and let the cluster autoscaler adjust workers. Watch the spend on the surrounding commercial tools — CrowdStrike, Dynatrace/Datadog, Wiz, and Vault Enterprise are typically priced per node/host/workload, so a fleet that scales horizontally scales those bills too; meter them per cluster in your observability tool and charge back to the teams that drive the node count.