Containerization Platform

Deploy MetalLB and kube-vip for Bare-Metal Kubernetes Load Balancing

A manufacturing company runs three Kubernetes clusters in its own datacenter — there is no cloud, by policy, because the workloads talk to PLCs on the factory floor and the latency and data-residency rules forbid leaving the building. The platform team stood up the clusters with kubeadm and immediately hit the wall every bare-metal operator hits: kubectl get svc shows their ingress controller stuck <pending> forever, because nothing on prem implements the LoadBalancer Service type the way a cloud controller-manager does. Worse, the API server is reachable only on one control-plane node’s IP, so a single reboot orphans every kubectl, every CI runner, and every Argo CD sync. This guide fixes both problems with two small, complementary projects: kube-vip for a highly-available control-plane VIP, and MetalLB to hand out real external IPs to Services — first the simple L2 way, then the production BGP way.

Prerequisites

Target topology

Deploy MetalLB and kube-vip for Bare-Metal Kubernetes Load Balancing — topology

The two layers solve two different problems and must not be confused. kube-vip owns a single floating IP for the Kubernetes API server (192.168.40.10:6443); it runs as a static pod on every control-plane node, holds the VIP on whichever node currently leads a leader election, and moves it on failure. MetalLB owns the pool of external Service IPs (192.168.40.200-250); when you create a Service of type: LoadBalancer, MetalLB’s controller allocates an address from the pool and its speakers advertise it to the network — by gratuitous ARP in L2 mode, or by peering with your routers in BGP mode. Ingress (your NGINX or Envoy ingress controller) sits behind one of those MetalLB IPs; Akamai fronts the published apps at the internet edge for TLS, WAF and global caching, with its origin pointed at the MetalLB-assigned ingress VIP. North-south identity for cluster operators flows from Okta (federated to Entra ID) into the API server via OIDC, so the VIP that kube-vip protects is the single audited front door for every kubectl and every pipeline.

1. Provide a control-plane VIP with kube-vip

Generate the kube-vip static-pod manifest on the first control-plane node. kube-vip ships its own generator inside the container image, so you render the manifest with docker/crictl and drop it into the static-pod directory. Pick ARP mode for a flat L2 control-plane network (the common case); BGP for the API VIP is possible but most teams keep the API on ARP and reserve BGP for Services.

# On control-plane node #1, as root
export VIP=192.168.40.10
export INTERFACE=eth0           # the NIC on the node/API network
export KVVERSION=v0.8.7

# Render the static pod manifest using the kube-vip image itself
ctr image pull ghcr.io/kube-vip/kube-vip:$KVVERSION
ctr run --rm --net-host ghcr.io/kube-vip/kube-vip:$KVVERSION vip \
  /kube-vip manifest pod \
    --interface $INTERFACE \
    --address $VIP \
    --controlplane \
    --arp \
    --leaderElection \
  | tee /etc/kubernetes/manifests/kube-vip.yaml

The kubelet watches /etc/kubernetes/manifests/ and starts the static pod within seconds. Because this is a static pod, kube-vip comes up before the rest of the control plane is healthy, which is exactly what you want — the VIP must exist for kubeadm join --control-plane to work. Give kube-vip RBAC for the leader-election lease:

kubectl apply -f https://kube-vip.io/manifests/rbac.yaml

Crucial bootstrap detail: when you first ran kubeadm init, the --control-plane-endpoint must already point at the VIP (or its DNS name), not a node IP:

kubeadm init \
  --control-plane-endpoint "k8s-api.corp.local:6443" \
  --upload-certs \
  --pod-network-cidr=10.244.0.0/16

Repeat the static-pod manifest drop on control-plane nodes #2 and #3 (same command, same $VIP). The three kube-vip instances run a leader election over a Lease; only the leader answers ARP for 192.168.40.10. Reboot the leader and the VIP migrates in a couple of seconds — kubectl reconnects without you changing a thing.

2. Install MetalLB

Install MetalLB with Helm so upgrades are a chart bump in your GitOps repo rather than a hand-edited manifest. MetalLB has two workloads: a single controller Deployment (does IP allocation) and a speaker DaemonSet (advertises the IPs from every node).

helm repo add metallb https://metallb.github.io/metallb
helm repo update

helm install metallb metallb/metallb \
  --namespace metallb-system --create-namespace \
  --version 0.14.9 \
  --set speaker.frr.enabled=true        # enable the FRR backend now; needed for BGP later

Wait for the pods, then confirm the CRDs are present — all configuration in modern MetalLB is via CRDs (the old ConfigMap is removed):

kubectl -n metallb-system rollout status deploy/controller
kubectl -n metallb-system rollout status ds/speaker

kubectl get crds | grep metallb.io
# ipaddresspools.metallb.io
# l2advertisements.metallb.io
# bgpadvertisements.metallb.io
# bgppeers.metallb.io

Define the address pool now — it is shared by both L2 and BGP modes. This is the only place your reserved Service range lives; treat it as managed state in Terraform/Git, not tribal knowledge.

# metallb-pool.yaml
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: prod-pool
  namespace: metallb-system
spec:
  addresses:
    - 192.168.40.200-192.168.40.250
  autoAssign: true
  avoidBuggyIPs: true        # skip .0 and .255 in any /24 it touches
kubectl apply -f metallb-pool.yaml

3. Mode A — Layer 2 (ARP) advertisement

L2 mode is the fastest path to a working LoadBalancer and needs zero network-team involvement: MetalLB simply answers ARP for the Service IP from one elected node. Apply an L2Advertisement that ties the pool to L2:

# metallb-l2.yaml
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: l2-prod
  namespace: metallb-system
spec:
  ipAddressPools:
    - prod-pool
  interfaces:
    - eth0          # restrict ARP to the data network NIC
kubectl apply -f metallb-l2.yaml

Now create a test Service and watch it get a real IP instead of <pending>:

kubectl create deploy web --image=nginx --port=80
kubectl expose deploy web --type=LoadBalancer --port=80 --target-port=80

kubectl get svc web -w
# NAME   TYPE           EXTERNAL-IP      PORT(S)
# web    LoadBalancer   192.168.40.200   80:31234/TCP

From any host on the L2 network, curl http://192.168.40.200 now hits NGINX. Understand the tradeoff before you ship it: in L2 mode one node holds a given Service IP at a time, so it is a failover mechanism, not load-balancing — all traffic for that IP ingresses through the elected node, and bandwidth is capped at that node’s NIC. It is perfect for an ingress controller VIP (where the ingress itself spreads load internally) and fine for moderate throughput. When you outgrow it, move to BGP without touching your apps.

4. Mode B — BGP advertisement (production)

BGP mode is what you run at scale: MetalLB peers with your routers, advertises each Service IP as a /32 route, and the routers use ECMP to spread traffic across all speaker nodes simultaneously — true horizontal load-balancing with no single chokepoint. First define the peer (your top-of-rack switch) and a BGP-specific advertisement:

# metallb-bgp.yaml
apiVersion: metallb.io/v1beta1
kind: BGPPeer
metadata:
  name: tor-switch
  namespace: metallb-system
spec:
  myASN: 64512            # the cluster's ASN
  peerASN: 64501          # the router's ASN
  peerAddress: 192.168.40.1
  peerPort: 179
  holdTime: 90s
  keepaliveTime: 30s
  password: ""            # set from a Vault-injected secret in real deployments
---
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
  name: bgp-prod
  namespace: metallb-system
spec:
  ipAddressPools:
    - prod-pool
  aggregationLength: 32        # advertise each Service IP as a host route
  localPref: 100
kubectl apply -f metallb-bgp.yaml
# Remove the L2Advertisement so the pool is advertised one way only
kubectl delete l2advertisement l2-prod -n metallb-system

The matching side on a FRR/Cisco-style top-of-rack switch — managed by Terraform against the switch provider, never typed live:

router bgp 64501
  bgp router-id 192.168.40.1
  neighbor 192.168.40.0/24 peer-group K8S
  neighbor K8S remote-as 64512
  neighbor K8S passive            ! let the speakers initiate
  neighbor K8S timers 30 90
  address-family ipv4 unicast
    maximum-paths 6              ! ECMP across up to 6 speaker nodes

Confirm the sessions come up from MetalLB’s side:

kubectl get bgppeers -n metallb-system -o wide
# Check the FRR speaker for an Established session:
kubectl -n metallb-system exec ds/speaker -c frr -- vtysh -c "show bgp summary"
# Neighbor       V   AS   State/PfxRcd
# 192.168.40.1   4 64501   Established

Recreate the same Service from step 3 — apps are mode-agnostic, only the network plumbing changed. Now every speaker node advertises 192.168.40.200/32, the router installs ECMP next-hops, and traffic spreads across the cluster. One caveat with your CNI: if you also run BGP in Calico, give it and MetalLB different ASNs or scope them so they do not fight over the same sessions — a frequent and confusing outage.

5. Wire ingress, edge, and the operating model

Point your ingress controller’s Service at a stable MetalLB IP so DNS and the edge never chase a moving target. Pin it explicitly rather than letting MetalLB auto-assign:

# nginx-ingress-svc patch
apiVersion: v1
kind: Service
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
  annotations:
    metallb.io/loadBalancerIPs: 192.168.40.210     # stable, documented VIP
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local    # preserve client source IP; only schedule to nodes with a pod

externalTrafficPolicy: Local matters: it preserves the real client IP (your WAF and audit logs need it) and, in BGP mode, makes the router advertise the IP only from nodes actually running an ingress pod — tighter, healthier routing. Now slot this into the wider operating model the platform team already runs:

Two workloads worth calling out as natural early tenants of this LB: a fleet of virtual appliances (firewalls, SD-WAN concentrators) that the team is migrating into the cluster as pods need stable external IPs MetalLB now provides; and Moodle, the company’s internal training LMS, which gets a dedicated MetalLB VIP behind Akamai as the first user-facing service proving the path end to end.

Validation

Run these after any change to prove both layers are healthy:

# 1. Control-plane VIP is live and owned by a leader
ping -c2 192.168.40.10
kubectl get lease -n kube-system kube-vip-cp -o wide      # who holds the VIP

# 2. MetalLB allocated an IP (no <pending>)
kubectl get svc -A | grep LoadBalancer

# 3. The IP is actually reachable
curl -sS -o /dev/null -w "%{http_code}\n" http://192.168.40.200

# 4. BGP sessions are Established and routes advertised
kubectl -n metallb-system exec ds/speaker -c frr -- vtysh -c "show bgp ipv4 unicast"

# 5. Failover test: drain the node holding a Service IP, confirm curl still answers
NODE=$(kubectl get svc web -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
kubectl drain <speaker-node> --ignore-daemonsets --delete-emptydir-data
curl -sS http://$NODE      # should still return 200 via another node
kubectl uncordon <speaker-node>

A green run is: VIP pings and a Lease holder is shown, no Service is <pending>, the external IP returns 200, BGP shows Established, and the drain test keeps serving.

Rollback / teardown

Reverse cleanly — MetalLB first (it advertises live traffic), then kube-vip.

# Stop advertising and remove MetalLB config, then the chart
kubectl delete l2advertisement,bgpadvertisement,bgppeer,ipaddresspool -n metallb-system --all
helm uninstall metallb -n metallb-system
kubectl delete ns metallb-system

# Remove kube-vip from EACH control-plane node (kubelet stops the static pod on file removal)
rm -f /etc/kubernetes/manifests/kube-vip.yaml      # run on every control-plane node
kubectl delete -f https://kube-vip.io/manifests/rbac.yaml

Important ordering note: if the API server’s --control-plane-endpoint is the kube-vip VIP, do not tear down kube-vip while the cluster is in use — you would cut off the API. Migrate the endpoint to a real load balancer or a single node IP first, then remove kube-vip. Any Service of type: LoadBalancer reverts to <pending> once MetalLB is gone; switch those you still need to NodePort as an interim.

Common pitfalls

Security notes

The control-plane VIP is the cluster’s single most sensitive endpoint: lock the API server behind OIDC from Okta/ Entra ID so every operator is authenticated and MFA-gated, restrict :6443 to management networks, and put a password on the BGPPeer sourced from HashiCorp Vault so no rogue host can inject routes. MetalLB speakers run with elevated network capabilities (they manipulate ARP/BGP), so keep CrowdStrike Falcon runtime protection on those nodes and let Wiz Code gate the IaC against an IPAddressPool that accidentally spans a public or management subnet. Treat the Service pool as a security boundary, not just an allocation list.

Cost notes

This is the cheap part of bare metal: both MetalLB and kube-vip are free, open-source, and replace per-hour cloud load balancers entirely — there is no NLB/ALB bill and no per-rule charge. The only real costs are the reserved IP space (free, just planning) and the operational time to run BGP correctly. Budget for redundant top-of-rack switches if you go BGP — ECMP across speakers is only as available as the routers underneath — and fold the Akamai edge and Dynatrace/ Datadog monitoring into existing contracts rather than standing up new tooling. Compared to renting cloud load balancers for three clusters, this pays for the engineering time within the first month and then keeps paying every month after.

KubernetesMetalLBkube-vipBare MetalBGPLoad Balancing
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading