Containerization Operations

Kubernetes Monitoring, In Depth: metrics-server, Prometheus, Grafana & Alerting

A Kubernetes cluster is a fleet of moving parts — pods being scheduled and evicted, nodes filling up, Deployments rolling, autoscalers reacting — and you cannot operate what you cannot see. Monitoring is the sensory system of a cluster: it tells you whether the control plane is healthy, whether your workloads are getting the CPU and memory they asked for, whether users are actually being served, and — crucially — it is the source of truth that the Horizontal Pod Autoscaler reads to decide whether to add replicas. Get monitoring wrong and you are flying blind; get it right and the cluster becomes legible, debuggable and, increasingly, self-managing.

This lesson is a complete tour of the native Kubernetes monitoring stack, built bottom-up. We start with metrics-server — the small, in-cluster component that powers kubectl top and the HPA — and are careful to draw the line between it and a real monitoring system, because confusing the two is the single most common beginner mistake. We then build out Prometheus: its pull-based architecture, the four metric types (counter, gauge, histogram, summary), how scraping and service discovery work on Kubernetes, the PromQL query language, and the two exporters that turn a cluster into a rich metrics source — kube-state-metrics (object state) and node-exporter (machine metrics). We cover the Prometheus Operator and its ServiceMonitor/PodMonitor custom resources, which is how essentially everyone runs Prometheus on Kubernetes today. We add Grafana for dashboards and Alertmanager for turning firing rules into paged humans (with routing, grouping, inhibition and silences). Finally we step up a level to the methodology that separates noise from signal: the four golden signals, the USE and RED methods, and how to define SLIs and SLOs with error budgets so your alerts page you when users hurt — not when a graph wiggles.

By the end you will understand every component, every metric type, the query language, the Operator CRDs and the alerting pipeline well enough to stand up the stack, debug it, answer CKA-style questions about it and design alerts that a tired on-call engineer will thank you for.

Learning objectives

By the end of this lesson you will be able to:

Prerequisites & where this fits

You should be comfortable with core Kubernetes objects (Pods, Deployments, Services, Namespaces) and basic kubectl, and ideally have read the autoscaling lesson, since the HPA is the most important consumer of metrics-server. You will want a local cluster (kind or minikube) and Helm v3 for the hands-on lab. No prior Prometheus knowledge is assumed — we define every term.

This lesson sits in the Operations module of the Kubernetes Zero-to-Hero course, immediately after the networking-internals lesson (you need to understand Services and DNS to follow scrape discovery) and before the worker-node-internals lesson. Monitoring is the foundation that day-2 operations, autoscaling, SLOs and incident response all build on, so it is worth studying carefully.

Core concepts: observability, the two metrics pipelines, and the metrics triad

Observability is the property of a system that lets you ask arbitrary questions about its internal state from the outside, using its outputs. The three classic pillars of observability are metrics (numeric time series — cheap, aggregatable, ideal for alerting and dashboards), logs (timestamped text records of discrete events — rich, high-cardinality, good for forensics) and traces (the path of a single request across services — essential for distributed debugging). This lesson is about metrics; logs and traces are covered in the SigNoz/OpenTelemetry lesson linked at the end. A complete observability strategy needs all three, but metrics are where you start because they are cheap, they power alerting, and they drive autoscaling.

The single most important mental model in Kubernetes monitoring is that there are two separate metrics pipelines, and they exist for different reasons.

Pipeline Component API it serves Storage Consumers Purpose
Resource metrics metrics-server metrics.k8s.io (Metrics API) In-memory, ~last value only kubectl top, HPA, VPA, scheduler hints Fast, lightweight CPU/RAM for autoscaling
Full / custom metrics Prometheus (+ adapter) custom.metrics.k8s.io, external.metrics.k8s.io, plus PromQL/HTTP On-disk TSDB, weeks of history Dashboards, alerting, HPA-on-custom-metrics Rich, historical, queryable monitoring

The resource metrics pipeline is deliberately minimal: metrics-server scrapes only CPU and memory from each kubelet, keeps roughly the latest value in memory, and exposes it through the aggregated Metrics API (metrics.k8s.io). It has no history, no query language and no dashboards — it exists so that kubectl top is fast and so the HPA has a low-latency source. The full monitoring pipeline is Prometheus: it scrapes hundreds of metrics from many targets, stores them in a time-series database (TSDB) on disk for weeks, and exposes a powerful query language (PromQL). When people say “monitoring”, they almost always mean Prometheus; metrics-server is a tiny, single-purpose cousin. Confusing them — e.g. expecting kubectl top to show you yesterday’s memory, or expecting Prometheus to drive a basic CPU HPA without an adapter — is the classic beginner error.

A few more terms you will see throughout:

metrics-server: the resource metrics pipeline in full

metrics-server is a cluster add-on that collects resource usage (CPU and memory) from every node’s kubelet and exposes it through the Metrics API (metrics.k8s.io), registered into the API server via the API Aggregation Layer. It is what makes kubectl top nodes and kubectl top pods return numbers, and it is the default source for the HorizontalPodAutoscaler and VerticalPodAutoscaler.

How it works, end to end:

  1. Each kubelet computes CPU and memory usage for the node and its pods (sourced from the container runtime via cAdvisor, exposed at the kubelet’s /metrics/resource endpoint).
  2. metrics-server scrapes every kubelet on a short interval (default 15 seconds) over the kubelet’s secure port (10250).
  3. It keeps the latest readings in memory only (no database, no history) and serves them through the aggregated metrics.k8s.io API.
  4. kubectl top, the HPA controller and the VPA query that API.

Key facts and gotchas:

Aspect Detail
Metrics collected CPU and memory only — nothing else (no disk, no network, no custom metrics)
History None — roughly the latest scrape; you cannot ask “what was memory an hour ago”
Scrape interval --metric-resolution, default 15s; must be ≥ kubelet housekeeping interval
Transport Scrapes kubelet over TLS on port 10250
HA Run ≥2 replicas for availability; readings are reconciled, not aggregated
Not for Monitoring, alerting, dashboards, capacity history — that is Prometheus’s job

The most common failure is metrics-server failing to scrape kubelets because of TLS: on kind/minikube and many self-managed clusters the kubelet serving certificate is self-signed and not in metrics-server’s trust store, so scrapes fail with x509 errors and kubectl top returns error: Metrics API not available. The well-known (and lab-only) fix is the flag --kubelet-insecure-tls; in production you instead enable proper kubelet serving certificates signed by the cluster CA (serverTLSBootstrap: true and an approver for the kubernetes.io/kubelet-serving CSRs). Other causes of “Metrics API not available”: metrics-server not installed at all, the aggregation layer not reaching the pod (network policy/firewall on port 4443/10250), or the pod crash-looping.

Because metrics-server feeds the HPA, its health is your autoscaler’s health: if kubectl top pods is broken, a CPU/memory HPA shows <unknown> in its TARGETS column and will not scale. Always verify metrics-server before debugging an HPA.

Prometheus: architecture and the pull model

Prometheus is an open-source monitoring system and time-series database, and the de-facto standard for Kubernetes. Its defining architectural choice is the pull model: Prometheus scrapes (HTTP GETs) a /metrics endpoint on each target on a fixed interval, rather than having targets push metrics to it. This is the opposite of many older systems (and of the StatsD/push style), and the trade-offs are worth understanding because they come up in interviews.

Pull (Prometheus default) Push (e.g. via Pushgateway)
Target discovery Prometheus owns the target list (service discovery) Targets must know the server address
Liveness signal A failed scrape is itself a signal (up == 0) A silent target looks identical to a healthy one
Short-lived jobs Awkward (job may exit before a scrape) — use Pushgateway Natural fit
Firewalls / NAT Prometheus must reach targets Targets reach Prometheus
Control Central control of scrape rate, easy to fan out Decentralised

The major components of the Prometheus ecosystem:

Storage. Prometheus’s local TSDB writes incoming samples to a write-ahead log (WAL) and periodically compacts them into immutable two-hour blocks on disk, later merged into larger blocks. Retention is by time (--storage.tsdb.retention.time, default 15 days) and/or size. Local storage is not clustered or replicated — for long-term storage and global query you use remote write to a system like Thanos, Cortex, Mimir or VictoriaMetrics. For this lesson, single-server local storage is exactly right.

A scrape config tells Prometheus what to scrape and how. The classic static example:

global:
  scrape_interval: 15s        # how often to pull each target
  evaluation_interval: 15s    # how often to evaluate alerting/recording rules
scrape_configs:
  - job_name: "prometheus"    # the 'job' label applied to these targets
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "node"
    static_configs:
      - targets: ["10.0.0.1:9100", "10.0.0.2:9100"]

On Kubernetes you almost never use static_configs; you use kubernetes_sd_configs (service discovery), which queries the API server for node, pod, service, endpoints, endpointslice or ingress objects and turns them into targets. Relabeling (relabel_configs) then filters and rewrites those targets — keep only pods with a certain annotation, set the scrape port from a label, drop noisy targets, and so on. In modern setups the Prometheus Operator generates all of this for you from ServiceMonitor/PodMonitor objects (covered below), so you rarely hand-write kubernetes_sd_configs — but you should understand that it is what runs underneath.

The four metric types

Every Prometheus metric is one of four types. Picking the right type is fundamental — using a gauge where you need a counter (or vice versa) produces nonsense graphs.

Type What it represents Can it decrease? Typical query Example
Counter A cumulative total that only ever increases (resets to 0 on restart) No (except reset) rate(x[5m]) http_requests_total, container_cpu_usage_seconds_total
Gauge A value that can go up or down Yes use directly, avg, max node_memory_MemAvailable_bytes, kube_pod_status_phase
Histogram Observations bucketed into configurable ranges, plus _sum and _count (buckets are counters) histogram_quantile() http_request_duration_seconds_bucket
Summary Client-side computed quantiles, plus _sum and _count (counters) read the quantile series directly rpc_duration_seconds{quantile="0.99"}

Counter. The workhorse. Counters only go up (a process restart resets to zero — Prometheus’s rate() detects and handles the reset). You almost never look at a counter’s raw value; you look at its rate of change. rate(http_requests_total[5m]) gives requests per second averaged over 5 minutes. By convention counters end in _total.

Gauge. A snapshot value that can rise and fall — current memory in use, current temperature, number of items in a queue, number of pods in Running. You graph gauges directly and aggregate them with avg/sum/max.

Histogram. The right tool for latency and response sizes. The application pre-defines buckets (e.g. ≤0.1s, ≤0.5s, ≤1s, …); each observation increments the counter for every bucket it falls into (buckets are cumulativele = “less than or equal”). A histogram exposes three series families: <name>_bucket{le="..."}, <name>_sum and <name>_count. Crucially, quantiles are computed server-side at query time with histogram_quantile(), which means histograms are aggregatable across instances — you can compute a cluster-wide p99 by summing buckets across pods. The cost is choosing buckets up front, and bucket cardinality. (Newer native histograms remove the fixed-bucket limitation and are far more efficient, but classic bucketed histograms remain the common case.)

Summary. Also for latency/sizes, but quantiles are computed client-side over a sliding window and exposed directly as {quantile="0.5"}, {quantile="0.99"} series, alongside _sum and _count. The advantage is accurate per-instance quantiles with no bucket choice; the fatal limitation is that summary quantiles cannot be aggregated — you cannot average two pods’ p99s to get a meaningful cluster p99. For that reason, prefer histograms for anything you will aggregate across replicas (which on Kubernetes is almost everything). Use summaries only when you need an exact single-instance quantile and will never aggregate.

Interview-grade one-liner: Use a histogram when you need to aggregate latency percentiles across instances (the Kubernetes default); use a summary only for accurate single-instance quantiles you will never sum.

PromQL: querying the time series

PromQL (Prometheus Query Language) is how you slice the data. Master a handful of constructs and you can answer almost anything.

Selectors and matchers. The simplest query is a metric name, which returns an instant vector (one sample per matching series at the evaluation time):

http_requests_total

Filter with label matchers in braces — = equals, != not-equals, =~ regex-match, !~ regex-not-match:

http_requests_total{job="api", status=~"5.."}      # all 5xx from the api job

Append a range in square brackets to get a range vector (a window of samples per series), which is what rate-style functions consume:

http_requests_total{job="api"}[5m]

Rate functions turn counters into per-second rates:

Function Use
rate(c[5m]) Per-second average over the window; smooth; the default for alerting/dashboards
irate(c[5m]) Instantaneous rate from the last two samples; spiky; for fast-moving graphs only
increase(c[1h]) Total increase over the window (= rate * window); good for “how many in the last hour”

Rule of thumb: use rate() for almost everything. Reach for irate() only for high-resolution graphs, never for alerts. Make the range at least 4× the scrape interval so a window always contains several samples.

Aggregation operators collapse many series into fewer, with by (keep these labels) or without (drop these labels):

sum(rate(http_requests_total[5m])) by (status)       # total RPS grouped by status code
avg(node_memory_MemAvailable_bytes) by (instance)    # avg available memory per node
topk(5, sum(rate(container_cpu_usage_seconds_total[5m])) by (pod))   # 5 hottest pods
count(kube_pod_status_phase{phase="Running"} == 1)   # number of running pods

Common aggregators: sum, avg, min, max, count, count_values, stddev, topk, bottomk, quantile, group.

Percentiles from histograms — the canonical latency query:

histogram_quantile(
  0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Note the pattern: rate() the buckets, sum ... by (le) to aggregate across instances while preserving the bucket boundary label, then histogram_quantile(). This is the most important PromQL snippet to memorise.

Binary operators and vector matching. Arithmetic (+ - * /), comparison (> < ==), and logical (and or unless) operators work between vectors, matching on identical label sets (use on(...)/ignoring(...) and group_left/group_right for many-to-one joins). An error-ratio SLI:

sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

Recording rules precompute expensive or frequently-used expressions on the evaluation_interval and save them as new series, so dashboards and alerts read a cheap pre-aggregated metric:

groups:
  - name: api-slo
    interval: 30s
    rules:
      - record: job:http_request_errors:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))

The naming convention level:metric:operation (e.g. job:http_request_errors:ratio_rate5m) signals the aggregation level. Alerting rules are similar but use alert:/expr:/for:/labels:/annotations: and feed Alertmanager (see below).

kube-state-metrics: the state of your cluster objects

By itself, Prometheus + node-exporter tells you about machines. To monitor Kubernetes objects — Deployments, Pods, DaemonSets, Jobs, PVCs, nodes-as-objects — you need kube-state-metrics (KSM). KSM is a service that listens to the Kubernetes API and exposes the current state of objects as Prometheus gauges, without modification, caching or opinion.

KSM answers “is the cluster in the state I declared?” questions:

The critical distinction interviewers probe:

metrics-server kube-state-metrics node-exporter
Source kubelet (cAdvisor) Kubernetes API objects the host OS (/proc, /sys)
Tells you resource usage (CPU/RAM) object state (desired vs actual, phases, counts) machine health (CPU, mem, disk, net, FS)
Exposes Metrics API for HPA Prometheus gauges Prometheus gauges
Example question “how much CPU is this pod using?” “are all my replicas available?” “is the node’s disk full?”

KSM does not report resource usage (that is metrics-server/cAdvisor) and it is not the same as metrics-server — it is a complement. A healthy Prometheus stack runs all three: node-exporter (machines), kube-state-metrics (objects), and metrics-server (HPA), with Prometheus also scraping cAdvisor-style container metrics from the kubelet directly.

node-exporter: machine-level metrics

node-exporter is the canonical Prometheus exporter for *nix machine metrics. It runs as a DaemonSet (one pod per node), reads the host’s /proc and /sys (mounted in), and exposes hardware/OS metrics on :9100/metrics. It is how you see what is happening underneath Kubernetes — the actual Linux box.

Key metric families:

Example: node CPU utilisation as a percentage:

100 * (1 - avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])))

Because node-exporter exposes the host, it is the basis of the USE method (Utilisation, Saturation, Errors — see below) for nodes, and it is what tells you the cluster is about to start evicting pods because a node’s disk or memory is exhausted — something Kubernetes object metrics alone cannot reveal.

The Prometheus Operator: ServiceMonitor, PodMonitor and friends

Running Prometheus on Kubernetes by hand-editing prometheus.yml and reloading it does not scale — every new service means editing central config. The Prometheus Operator solves this with the operator pattern: it installs CustomResourceDefinitions and a controller that generates Prometheus’s configuration from Kubernetes objects, so monitoring becomes declarative and namespaced. This is how virtually everyone runs Prometheus on Kubernetes today, most commonly via the kube-prometheus-stack Helm chart (Operator + Prometheus + Alertmanager + Grafana + node-exporter + kube-state-metrics + a library of dashboards and alert rules, in one install).

The Operator’s custom resources:

CRD Purpose
Prometheus Declares a Prometheus instance (replicas, retention, storage, resources, which monitors/rules it selects)
Alertmanager Declares an Alertmanager cluster
ServiceMonitor Declares how to scrape a set of Services (selects Services by label, names the port, sets path/interval)
PodMonitor Declares how to scrape Pods directly (no Service needed)
Probe Declares blackbox/synthetic probes of ingresses or static targets
PrometheusRule Declares recording and alerting rules as a Kubernetes object
AlertmanagerConfig Namespaced routing/receivers, so teams own their own alert routing
ScrapeConfig An escape hatch for raw scrape configs (e.g. external targets) the CRDs don’t cover

The key insight is the two-level selection: the Prometheus resource has a serviceMonitorSelector (a label selector) that picks which ServiceMonitors it honours; each ServiceMonitor in turn has a selector that picks which Services it scrapes. A ServiceMonitor that no Prometheus selects is silently ignored — the most common “my target isn’t showing up” cause.

A typical ServiceMonitor for an app whose Service exposes a metrics port:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: payments
  namespace: shop
  labels:
    release: kube-prometheus-stack   # must match the Prometheus serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app: payments                  # selects Services with this label
  namespaceSelector:
    matchNames: ["shop"]
  endpoints:
    - port: metrics                  # the *named* port on the Service
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s

The two most common Operator mistakes: (1) the ServiceMonitor’s labels do not match the Prometheus instance’s serviceMonitorSelector (so Prometheus ignores it), and (2) the port field must be the Service port’s name, not a number. Use a PodMonitor when there is no Service (e.g. a headless workload or a DaemonSet). Check the Operator did its job in Prometheus’s UI under Status → Targets and Status → Configuration.

A PrometheusRule carries alerts and recording rules as a first-class object the Operator wires into Prometheus:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payments-slo
  namespace: shop
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: payments.rules
      rules:
        - alert: PaymentsHighErrorRate
          expr: |
            sum(rate(http_requests_total{job="payments",status=~"5.."}[5m]))
              / sum(rate(http_requests_total{job="payments"}[5m])) > 0.02
          for: 10m
          labels: { severity: page }
          annotations:
            summary: "Payments 5xx error ratio above 2% for 10m"

Grafana: dashboards and visualisation

Grafana is the visualisation and dashboarding layer. It connects to one or more data sources (Prometheus being the canonical one, but also Loki for logs, Tempo for traces, and many databases), runs PromQL on your behalf, and renders panels (time series graphs, stat/gauge/bar panels, tables, heatmaps) on dashboards. It is technically independent of Prometheus, but the two are almost always deployed together (and kube-prometheus-stack bundles Grafana pre-wired with a large set of cluster dashboards).

What you should know:

A dashboard is for exploration and situational awareness; it should not be your alerting mechanism — nobody is staring at a screen at 3am. Alerts come from Prometheus rules.

Alertmanager: routing, grouping, inhibition and silences

Prometheus fires alerts (a rule’s expr is true for its for duration), but Prometheus does not send notifications. It pushes firing alerts to Alertmanager, whose entire job is to turn a stream of alerts into the right notifications to the right people, without spamming them. The split is deliberate: Prometheus decides what is wrong; Alertmanager decides who to tell and how.

The alert lifecycle: a PrometheusRule’s expr becomes true → it is pending for the for duration (debounce) → it becomes firing and is pushed to Alertmanager → Alertmanager groups, dedups, routes, applies inhibition and silences, then notifies a receiver.

Alertmanager’s core features:

Feature What it does
Routing tree A route with nested routes matches alerts by label (match/matchers) and sends them to a receiver; supports per-route timing
Grouping group_by bundles related alerts into one notification (e.g. all alerts for a cluster, or all instances of one alert) so a node failure is one page, not 50
Timing group_wait (wait before first send, to batch), group_interval (wait before sending updates to a group), repeat_interval (how often to re-notify a still-firing alert)
Receivers Integrations: email, Slack, PagerDuty, Opsgenie, webhook, etc.
Inhibition One alert suppresses others (e.g. a ClusterDown alert inhibits all the per-service alerts it would obviously cause)
Silences Time-bounded, label-matched mutes you create in the UI/API (e.g. during a planned maintenance window) — alerts still fire but are not notified
HA Run Alertmanager as a gossiping cluster; it dedups so multiple Prometheis don’t double-page

A representative configuration:

route:
  receiver: "slack-default"
  group_by: ["alertname", "namespace"]   # batch by alert + namespace
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers: ['severity="page"']       # page-severity goes to PagerDuty
      receiver: "pagerduty"
    - matchers: ['namespace="dev"']        # dev alerts go to a low-noise channel
      receiver: "slack-dev"

inhibit_rules:
  - source_matchers: ['severity="page"', 'alertname="ClusterDown"']
    target_matchers: ['severity=~"warning|page"']
    equal: ["cluster"]                     # cluster-down hides everything in that cluster

receivers:
  - name: "slack-default"
    slack_configs:
      - channel: "#alerts"
        api_url_file: /etc/alertmanager/secrets/slack-url   # never inline the webhook
  - name: "pagerduty"
    pagerduty_configs:
      - routing_key_file: /etc/alertmanager/secrets/pd-key
  - name: "slack-dev"
    slack_configs:
      - channel: "#alerts-dev"

Grouping is what keeps on-call humane: when a node dies and 40 pods go unready, group_by collapses that into a single notification. Inhibition is the next level — the NodeDown page suppresses the 40 downstream PodNotReady warnings entirely. Both should be configured before you enable real paging.

Kubernetes monitoring stack

The diagram above shows the full pipeline: node-exporter (per-node DaemonSet) and kube-state-metrics (cluster object state) and your instrumented apps expose /metrics; the Prometheus Operator turns ServiceMonitor/PodMonitor objects into scrape config; Prometheus pulls those targets into its TSDB; Grafana queries Prometheus for dashboards; Prometheus evaluates PrometheusRule alerts and pushes them to Alertmanager, which routes, groups and silences before notifying Slack/PagerDuty — while the separate, lightweight metrics-server pipeline feeds kubectl top and the HPA.

The four golden signals, USE and RED

Tools collect metrics; methodology decides which metrics matter and when to page. Three frameworks dominate, and they are complementary.

The four golden signals (from Google’s SRE book) — the signals to monitor for any user-facing service:

Signal Question Typical metric
Latency How long do requests take? (split success vs error latency) histogram_quantile(0.99, ...) on request duration
Traffic How much demand? sum(rate(http_requests_total[5m]))
Errors What fraction is failing? 5xx ratio, exception rate
Saturation How “full” is the service? (the constrained resource) queue depth, CPU/mem near limit, thread-pool usage

The RED method (Tom Wilkie) — a request-centric specialisation for services, easy to remember:

RED is essentially the golden signals minus saturation; it is the right default dashboard for every microservice.

The USE method (Brendan Gregg) — resource-centric, for machines and components:

The clean division: RED for services (request-driven, from app instrumentation), USE for resources (from node-exporter/cAdvisor). A complete cluster dashboard set has RED panels per service and USE panels per node. Saturation is the signal beginners most often omit and the one that gives you lead time — a service can be at 100% utilisation but only queuing (saturated) is the early warning of impending failure.

SLIs, SLOs and error budgets

The final and most important step is to stop alerting on causes (CPU is high) and start alerting on symptoms users feel, expressed as service-level objectives.

Burn-rate alerting is the modern best practice and the reason SLOs beat threshold alerts. Instead of paging on “errors > 2%”, you page on how fast you are consuming the error budget. A multi-window, multi-burn-rate alert combines a fast, high-rate signal (e.g. 14.4× burn over 1h means you’d exhaust a 30-day budget in ~2 days — page now) with a slower, lower-rate signal (e.g. 1× burn over 6h — ticket), each gated by a short and a long window so you alert quickly on big breakages but don’t flap on transient blips. The payoff: you page when users are actually being harmed at a rate that threatens the SLO, not every time a graph twitches — dramatically less alert fatigue, far fewer false pages. This is the destination the whole stack exists to reach.

Hands-on lab: stand up the full stack on kind

We will create a local cluster, install kube-prometheus-stack (Operator + Prometheus + Grafana + Alertmanager + node-exporter + kube-state-metrics), deploy a sample app, scrape it with a ServiceMonitor, query it in Prometheus and Grafana, and add a PrometheusRule. Everything runs locally and is free.

0. Prerequisites. Install kind, kubectl and helm (v3). Then create a cluster:

kind create cluster --name monitoring-lab
kubectl cluster-info --context kind-monitoring-lab

1. Install metrics-server (kind ships without it) and prove the resource pipeline.

helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo update
helm install metrics-server metrics-server/metrics-server -n kube-system \
  --set "args={--kubelet-insecure-tls}"          # lab-only: kind kubelet uses a self-signed cert

kubectl -n kube-system rollout status deploy/metrics-server
sleep 30
kubectl top nodes        # expect CPU/MEM columns with numbers, not an error
kubectl top pods -A      # per-pod usage

If kubectl top returns error: Metrics API not available, wait another 20–30s for the first scrape, then re-check; persistent failure means the --kubelet-insecure-tls flag did not apply.

2. Install the full Prometheus stack via Helm.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm install kps prometheus-community/kube-prometheus-stack -n monitoring \
  --set grafana.adminPassword='lab-password'

kubectl -n monitoring rollout status deploy/kps-grafana
kubectl -n monitoring get pods
# Expect: prometheus-kps-... , alertmanager-kps-... , kps-grafana-... ,
#         kps-kube-state-metrics-... , and a kps-prometheus-node-exporter-... per node.
kubectl -n monitoring get servicemonitors    # the bundled monitors (kubelet, apiserver, etc.)

3. Open the Prometheus UI and explore targets and types.

kubectl -n monitoring port-forward svc/kps-kube-prometheus-stack-prometheus 9090:9090
# (service name may be 'kps-prometheus' — check: kubectl -n monitoring get svc | grep prometheus)

Browse to http://localhost:9090. Under Status → Targets every target should be UP. In the expression bar try:

up                                                  # 1 per healthy target
kube_pod_status_phase{phase="Running"}              # kube-state-metrics: running pods (gauge)
rate(node_cpu_seconds_total{mode="idle"}[5m])       # node-exporter counter -> rate
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)   # per-pod CPU

4. Deploy a sample app and scrape it with a ServiceMonitor. We use an app that natively exposes Prometheus metrics on a metrics port.

kubectl create namespace demo
cat <<'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata: { name: hello, namespace: demo }
spec:
  replicas: 2
  selector: { matchLabels: { app: hello } }
  template:
    metadata: { labels: { app: hello } }
    spec:
      containers:
        - name: hello
          image: ghcr.io/prometheus/prometheus:v2.53.0   # exposes /metrics on 9090
          args: ["--config.file=/etc/prometheus/prometheus.yml"]
          ports: [{ name: metrics, containerPort: 9090 }]
---
apiVersion: v1
kind: Service
metadata: { name: hello, namespace: demo, labels: { app: hello } }
spec:
  selector: { app: hello }
  ports: [{ name: metrics, port: 9090, targetPort: metrics }]
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hello
  namespace: demo
  labels: { release: kps }          # must match the Prometheus serviceMonitorSelector
spec:
  selector: { matchLabels: { app: hello } }
  endpoints:
    - port: metrics                  # the *named* Service port, not a number
      interval: 15s
EOF

Wait ~30s, then in the Prometheus UI under Status → Targets you should see a new serviceMonitor/demo/hello/0 job with both pods UP. If it does not appear, check the ServiceMonitor release label matches (kubectl -n monitoring get prometheus -o yaml | grep -A3 serviceMonitorSelector) and that the endpoint port name matches the Service.

5. Add a PrometheusRule (alerting rule).

cat <<'EOF' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: hello-rules
  namespace: demo
  labels: { release: kps }
spec:
  groups:
    - name: hello.rules
      rules:
        - alert: HelloTargetDown
          expr: up{job="hello"} == 0
          for: 1m
          labels: { severity: warning }
          annotations: { summary: "A hello replica has been down for 1m" }
EOF

In the Prometheus UI under Alerts you should see HelloTargetDown in the inactive state. Delete a pod (kubectl -n demo delete pod -l app=hello --field-selector ... or just scale to expose it) to watch it go pending then firing.

6. Open Grafana and view a dashboard.

kubectl -n monitoring port-forward svc/kps-grafana 3000:80
# login: admin / lab-password

Browse to http://localhost:3000, open Dashboards and explore the bundled “Kubernetes / Compute Resources / Namespace (Pods)” dashboard. Then create a panel with the query sum(rate(container_cpu_usage_seconds_total{namespace="demo"}[5m])) by (pod) to see your app’s CPU.

7. Validation checklist.

kubectl top nodes                                  # metrics-server pipeline works
kubectl -n monitoring get prometheus,alertmanager  # Operator-managed instances exist
kubectl get servicemonitor -A                      # your 'hello' monitor is listed
# In Prometheus UI: Status->Targets all UP; Alerts shows HelloTargetDown
# In Grafana: a cluster dashboard renders data

8. Cleanup.

kubectl delete namespace demo
helm uninstall kps -n monitoring
kubectl delete namespace monitoring
helm uninstall metrics-server -n kube-system
kind delete cluster --name monitoring-lab

Cost note. Everything here runs in a local kind cluster on your laptop — zero cloud cost. In a managed cluster, the levers that move the monitoring bill are Prometheus retention (disk), scrape interval and cardinality (RAM and storage scale with the number of time series — the dominant cost), Grafana/Alertmanager being negligible, and any remote-write to a long-term backend (Thanos/Mimir/managed Prometheus) which is usually the largest line item at scale.

Common mistakes & troubleshooting

Symptom Likely cause Fix
kubectl top → “Metrics API not available” metrics-server not installed, or kubelet TLS (x509) scrape failures Install metrics-server; on self-signed kubelets use --kubelet-insecure-tls (lab) or enable signed kubelet serving certs (prod)
HPA TARGETS shows <unknown> metrics-server broken, or requests not set on pods Fix metrics-server first; set CPU/memory requests (utilisation targets are % of request)
ServiceMonitor created but no target appears Its labels don’t match the Prometheus serviceMonitorSelector, or wrong namespace selection Add the matching label (e.g. release: kps); check Status → Configuration/Targets in the UI
ServiceMonitor endpoint scrapes nothing port set to a number instead of the Service port name Use the named port; ensure the Service actually exposes /metrics
Latency p99 graph looks wrong / impossible to aggregate Using a summary and trying to average quantiles across pods Switch to a histogram and use histogram_quantile(0.99, sum(rate(..._bucket[5m])) by (le))
Counter graph shows huge negative spikes Graphing a counter’s raw value across a restart (reset to 0) Use rate()/increase(), which handle resets
Alertmanager floods on a node failure No group_by/inhibition configured Group by alertname/node; add inhibition so NodeDown suppresses downstream alerts
Prometheus OOMs or disk fills High cardinality (per-request/per-user labels) or long retention Drop high-cardinality labels via relabeling; tune retention; pre-aggregate with recording rules
Alerts never fire despite the condition being true Rule has a long for: and the condition flaps below it; or PrometheusRule label not selected by the Prometheus Lower/verify for:; ensure the rule’s labels match ruleSelector

Best practices

Security notes

Interview & exam questions

  1. What is the difference between metrics-server and Prometheus? metrics-server is a tiny resource-metrics pipeline that scrapes only CPU/memory from kubelets, keeps the latest value in memory (no history), and serves the aggregated Metrics API for kubectl top and the HPA. Prometheus is a full monitoring system: it scrapes many targets, stores time series on disk for weeks, and offers PromQL for dashboards and alerting. They are complementary; metrics-server is not a monitoring system.

  2. Why does Prometheus pull instead of push, and what are the trade-offs? Pull lets Prometheus own service discovery and gives a free liveness signal (a failed scrape sets up == 0); it’s awkward for short-lived jobs (use Pushgateway) and requires network reachability to targets. Push fits ephemeral jobs and firewalled targets but loses the “silent target = down” signal.

  3. Explain the four metric types and when to use each. Counter — monotonically increasing total, always rate()d (requests, bytes). Gauge — value up/down (memory, queue length). Histogram — bucketed observations, quantiles computed server-side with histogram_quantile(), aggregatable across instances — the default for latency. Summary — quantiles computed client-side, accurate per-instance but not aggregatable.

  4. Histogram vs summary — which for cluster-wide p99 latency, and why? Histogram. Summary quantiles are pre-computed per instance and cannot be meaningfully averaged across pods. Histograms expose raw le buckets, so you sum(rate(..._bucket[5m])) by (le) across instances and then histogram_quantile(0.99, ...).

  5. Write a PromQL query for the p99 request latency across all replicas of a service. histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)).

  6. rate() vs irate() vs increase()? rate() = smooth per-second average over the window (use for alerts/dashboards). irate() = instantaneous rate from the last two samples (spiky high-res graphs only). increase() = total rise over the window (rate × window), good for “how many in the last hour”. All three correctly handle counter resets.

  7. What do kube-state-metrics and node-exporter each provide, and how do they differ from metrics-server? kube-state-metrics exposes the state of Kubernetes objects (replica counts, pod phases, restart counts, node conditions) from the API. node-exporter exposes host OS metrics (CPU, memory, disk, network) per node via a DaemonSet. metrics-server exposes live resource usage (CPU/RAM) for the HPA. State vs machine vs usage — three different questions.

  8. What problem does the Prometheus Operator solve, and what do ServiceMonitor and PodMonitor do? It makes Prometheus configuration declarative and Kubernetes-native: instead of editing prometheus.yml, you create CRDs and a controller generates the config. ServiceMonitor says how to scrape a set of Services; PodMonitor scrapes Pods directly (no Service). A Prometheus resource selects which monitors it honours via serviceMonitorSelector.

  9. My ServiceMonitor exists but the target never appears in Prometheus. Why? Most often its labels don’t match the Prometheus instance’s serviceMonitorSelector (so it’s ignored), the port is a number instead of the Service port name, the namespaceSelector excludes it, or the Service doesn’t actually expose /metrics. Check Status → Targets/Configuration in the UI.

  10. What is the difference between Prometheus alerting rules and Alertmanager? Prometheus evaluates alerting rules and fires alerts when an expr is true for its for duration. Alertmanager receives firing alerts and handles routing, grouping, deduplication, inhibition, silences and delivery to receivers. Prometheus decides what is wrong; Alertmanager decides who to tell and how.

  11. What are the four golden signals, and how do USE and RED relate? Golden signals: latency, traffic, errors, saturation. RED (Rate, Errors, Duration) is the request-centric subset for services. USE (Utilisation, Saturation, Errors) is resource-centric for machines/components. Use RED for services, USE for resources; saturation is the early-warning signal beginners omit.

  12. Define SLI, SLO, SLA and error budget, and explain burn-rate alerting. SLI = a measured good/total ratio reflecting user happiness. SLO = a target for that SLI over a window (e.g. 99.9%/30d). SLA = a contractual promise with penalties (looser than the SLO). Error budget = 100% − SLO (99.9% ≈ 43 min/month). Burn-rate alerting pages on how fast you’re consuming the budget (multi-window/multi-burn-rate), so you alert on user-impacting harm rather than raw thresholds — far less alert fatigue.

  13. What is cardinality and why does it matter? Cardinality is the number of distinct time series (each unique metric-name + label-set combination). High-cardinality labels (user/request IDs, raw URLs) multiply series, ballooning Prometheus memory and storage — the top cause of OOMs. Keep label values bounded.

  14. Where does Prometheus store data and how do you keep it long-term / highly available? A local TSDB (WAL → 2h blocks → compaction), retained by time/size, not replicated. For long-term retention and global, HA query you use remote-write to Thanos, Cortex/Mimir or VictoriaMetrics; run multiple Prometheis scraping the same targets and let Thanos/Alertmanager dedup.

Quick check

  1. Which component powers kubectl top and is the default source for the HPA?
  2. A counter resets to 0 when its process restarts. Which PromQL function should you use so this reset doesn’t produce a misleading graph?
  3. You need a single cluster-wide p99 latency aggregated across 10 pods. Do you instrument with a histogram or a summary, and why?
  4. What does kube-state-metrics expose that node-exporter does not?
  5. Your ServiceMonitor exists but no target shows up in Prometheus. Name the two most likely causes.

Answers

  1. metrics-server (serving the metrics.k8s.io Metrics API).
  2. rate() (or increase()) — both detect and correct counter resets; never graph a raw counter.
  3. A histogram — summary quantiles are computed client-side and cannot be aggregated across instances, whereas histogram buckets can be summed and fed to histogram_quantile().
  4. The state of Kubernetes objects (replica counts, pod phases, restart counts, node conditions). node-exporter exposes host OS metrics (CPU/mem/disk/net), not object state.
  5. (a) The ServiceMonitor’s labels don’t match the Prometheus instance’s serviceMonitorSelector (so it’s ignored); (b) the endpoint port is a number instead of the Service port name (or the Service doesn’t expose /metrics).

Exercise

On a fresh kind cluster:

  1. Install metrics-server and confirm kubectl top nodes and kubectl top pods -A return data.
  2. Install kube-prometheus-stack via Helm and confirm every target under Status → Targets is UP.
  3. Deploy any app that exposes /metrics, expose it with a Service, and add a ServiceMonitor so Prometheus scrapes it. Prove the target appears.
  4. In Prometheus, write three queries: (a) the per-second request rate of your app’s HTTP counter using rate(); (b) the p95 latency using histogram_quantile() over the bucket metric; © the number of Running pods in your namespace using a kube_state_metrics gauge.
  5. Create a PrometheusRule that fires when your app’s 5xx error ratio exceeds 1% for 5 minutes. Trigger it (e.g. by stopping the backend) and watch it move pending → firing in the Alerts tab.
  6. In Grafana, build a small RED dashboard (Rate, Errors, Duration) for your app using template variable $namespace.
  7. Configure Alertmanager to group_by: [alertname, namespace] and route severity="page" to one receiver and everything else to another. Add a silence for your test alert and confirm it stops notifying while still firing.

Write down, in two sentences, the SLI and SLO you would set for this app, and the error budget (in minutes/month) it implies.

Certification mapping

Glossary

Next steps

KubernetesPrometheusGrafanaAlertmanagerObservabilityPromQL
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading