GCP Troubleshooting

Advanced Google Cloud Troubleshooting: Complex Multi-Service Incidents & RCA

Single-service troubleshooting — the kind in the previous lesson, where a VM won’t boot or a firewall rule blocks a port — is the easy half of the job. The hard half is the 02:00 page that says checkout latency is up 400% and the error rate is climbing, where nothing is obviously broken and the cause turns out to be a quota you didn’t know you were near, a zone Google is draining for maintenance, or an Org Policy a platform team applied at the folder level six hours ago. Complex incidents are systemic: a small change or a partial failure in one service propagates through dependencies and shows up as a symptom three layers away from the cause. You cannot grep your way out of these. You need a repeatable incident-response lifecycle, the discipline to correlate signals across Cloud Monitoring, Cloud Logging, Cloud Trace and Error Reporting, and — afterwards — a blameless postmortem that turns the pain into permanent prevention.

This lesson is the senior on-call’s playbook. It builds the incident lifecycle, shows you how to correlate across the four pillars of Google Cloud observability plus Personalised Service Health, then works five genuinely complex multi-service scenarios as symptom → hypotheses → cross-service diagnosis → fix. It closes with how to write a blameless postmortem and drive corrective and preventive actions (CAPA) so the same incident never recurs. The goal is that you can walk into a war room, take the role of Incident Commander, and run it calmly.

Learning objectives

By the end of this lesson you will be able to:

Prerequisites

You should be comfortable with everything in the previous lesson, Google Cloud Troubleshooting Playbooks: IAM, VPC, Compute, Cloud SQL & GKE — the reproduce → isolate-the-layer → inspect-logs method, and the per-service diagnostic steps. You should know the resource hierarchy (Organization → Folders → Projects), IAM allow/deny policies, VPC and load-balancing basics, and how to read Cloud Logging. This is the Advanced rung of the Troubleshooting module in the Google Cloud Zero-to-Hero course: where the previous lesson localises a fault within a service, this one localises a fault across services and wraps it in incident management. A $300 free-trial or any project with Owner/Editor and the Monitoring and Logging APIs enabled is enough for the lab.

Core concepts: the incident-response lifecycle

An incident is an unplanned disruption or degradation that needs an urgent, coordinated response — distinct from a routine bug. The single most important mindset shift from single-service debugging is this: in an incident, mitigation comes before root cause. Stop the bleeding for users first; understand exactly why later. A junior engineer wants to find the bug; the Incident Commander wants the error rate back to baseline, by rollback or failover or feature-flag, even if the why is still unknown.

The lifecycle has six phases. They overlap in practice — you communicate continuously, and you start gathering RCA evidence the moment you detect — but naming them keeps a chaotic war room organised.

Phase Goal Key activities Google Cloud tooling
Detect Know something is wrong, fast Alerting policies on SLO burn rate, uptime checks, anomaly detection, user reports Cloud Monitoring alerting, uptime checks, SLO burn-rate alerts
Triage Assess severity & scope; assign roles Declare incident, set severity (SEV), name an Incident Commander (IC), scribe, comms lead Incident doc/template; severity matrix
Communicate Keep stakeholders informed Status updates on a cadence; single source of truth; internal + customer comms Status page, chat channel, incident doc
Mitigate Restore service (not necessarily fix the cause) Rollback, failover, scale up, raise quota, disable a feature flag, drain a zone Cloud Deploy rollback, traffic splitting, MIG/Cloud Run revisions
RCA Understand why it happened Correlate signals, build a timeline, identify contributing factors Logs Explorer, Cloud Trace, Error Reporting, Cloud Monitoring, Audit Logs
Prevent Stop recurrence Blameless postmortem, CAPA with owners/dates, alerting & guardrail improvements Postmortem doc, issue tracker, SLOs, Org Policy

Severity (SEV) levels drive how loudly you respond. A workable default:

Severity Meaning Example Response
SEV1 Critical, business-wide, customer-facing outage Checkout down globally; data-loss risk All-hands war room, IC + comms, exec notified, customer status page
SEV2 Major degradation, significant subset of users One region down; checkout slow but working War room, IC, status updates every 30 min
SEV3 Minor / partial, workaround exists One non-critical feature failing On-call handles; no war room; tracked
SEV4 Negligible user impact Elevated error rate within SLO budget Ticket; fix in business hours

Roles matter more than people. The Incident Commander owns the response (not the fix — they coordinate, decide, and delegate); the Operations/Subject lead drives the technical investigation; the Communications lead owns stakeholder and customer updates; the Scribe keeps a timestamped log. In a small team one person may wear several hats, but the IC role should always be explicitly held by someone, and that someone should not also be heads-down in a terminal.

The IC’s job is to be slightly bored. If the Incident Commander is frantically typing gcloud commands, nobody is steering. The IC asks “what’s our current hypothesis, what’s our mitigation, and who owns it?” — and keeps the room from chasing five theories at once.

A few load-bearing principles:

Correlating signals across services

The skill that separates senior on-call from junior is correlation: taking a symptom in one service and walking the dependency graph through telemetry to the cause in another. Google Cloud gives you four primary observability pillars plus two incident-specific tools. Know what each is for — using the wrong one wastes precious minutes.

Tool Answers the question Use it when Notes
Cloud Monitoring (metrics, dashboards, alerting, SLOs) What is wrong and how bad — the shape of the problem over time Always start here: latency, error rate, saturation, traffic (the four golden signals) MQL/PromQL; build a war-room dashboard ahead of time
Logs Explorer / Log Analytics Why a specific request or component failed; who did what After metrics localise the layer; for exact error strings and Audit Logs Log Analytics runs SQL (BigQuery) over logs for ad-hoc joins
Cloud Trace Where latency is spent across a distributed request Latency incidents; “the API is slow but which hop?” Distributed tracing; shows the span that blew the budget
Error Reporting Which error is new or spiking, grouped & deduplicated A flood of exceptions; “is this error new since the deploy?” Auto-groups stack traces; links to first/last seen
Personalised Service Health (PSH) Is Google having an incident affecting my projects? Before you blame your own code — rule out the platform Project-scoped; relevance-filtered; has an API & alerts
Active Assist / Recommender Are there pre-incident risks (quota, reliability, security)? Proactively, and in RCA to spot the warning you missed Quota recommendations, unattended-project, IAM insights

The correlation workflow in practice:

  1. Anchor on the timeline. Note the exact start time from the alert or the first metric inflection. Almost every incident correlates to something that changed — a deploy, a config push, a traffic spike, or a Google maintenance event. Hold that timestamp in your head; everything you look at, you line up against it.
  2. Start with the golden signals in Cloud Monitoring: latency, traffic, errors, saturation. The pattern tells you the class of problem. Errors up + latency up + traffic flat ⇒ a dependency is failing. Latency up + saturation up + traffic up ⇒ you’re overloaded. Errors up + traffic down ⇒ something upstream (DNS, LB, cert) is stopping requests reaching you.
  3. Rule out Google first with Personalised Service Health. If GCP is having a regional incident, your mitigation is failover, not debug.
  4. Localise the layer, then switch tools: Trace for where the latency is, Error Reporting for which exception, Logs Explorer for the exact failing request and the Audit Logs of who changed what.
  5. Find the change. Query Admin Activity audit logs around the start time — a config change, IAM binding, Org Policy edit, or quota change is the cause more often than a code bug.

A correlation query you will use constantly — what changed in the minutes before the incident — joins the timeline to the Admin Activity audit log:

# Who changed what in this project in the incident window?
gcloud logging read '
  logName="projects/PROJECT_ID/logs/cloudaudit.googleapis.com%2Factivity"
  AND timestamp>="2026-06-15T01:50:00Z"
  AND timestamp<="2026-06-15T02:10:00Z"
' --project=PROJECT_ID --order=asc \
  --format='table(timestamp, protoPayload.authenticationInfo.principalEmail, protoPayload.methodName, resource.type)'

In Log Analytics the same idea becomes a SQL query you can join, aggregate and pivot — invaluable when you need to count errors by service or correlate two log sources:

SELECT
  timestamp,
  proto_payload.audit_log.authentication_info.principal_email AS who,
  proto_payload.audit_log.method_name AS action,
  resource.type AS resource
FROM `PROJECT_ID.global._Default._AllLogs`
WHERE log_id = "cloudaudit.googleapis.com/activity"
  AND timestamp BETWEEN "2026-06-15 01:50:00 UTC" AND "2026-06-15 02:10:00 UTC"
ORDER BY timestamp;

Google Cloud complex incident response & RCA

The diagram above maps the incident lifecycle across the top and, beneath it, shows how a symptom at the load balancer is traced back through Monitoring, Trace, Error Reporting and Audit Logs to a contributing change — the mental model for every scenario that follows.

Scenario A — Cascading failure from quota exhaustion

Symptom. At 02:00 the checkout SLO burn-rate alert fires. Cloud Monitoring shows backend 5xx climbing from 0.1% to 18% over ten minutes; p99 latency has tripled. Traffic is slightly elevated (a marketing email went out) but not extreme. The frontend Cloud Run service is throwing 503s and the GKE payment service shows pods crash-looping. Nothing was deployed.

Hypotheses.

  1. A downstream dependency (Cloud SQL, an external payment API) is failing and back-pressure is cascading upward.
  2. A quota or limit has been hit — Cloud Run max instances, Cloud SQL connections, a Compute Engine CPU quota blocking autoscale, or an API rate limit (429s).
  3. A noisy-neighbour or memory leak is OOM-killing pods, and retries are amplifying load (a retry storm).
  4. Google is having an incident (rule out with PSH).

Cross-service diagnosis. PSH is green, so it’s us. The shape — errors and latency up, traffic only mildly up — says we hit a ceiling, not we got flooded. The cascade pattern (frontend 503 + payment pods crashing simultaneously) points to a shared resource. Walk the dependency graph downward:

# Connection saturation over the incident window
gcloud monitoring time-series list \
  --project=PROJECT_ID \
  --filter='metric.type="cloudsql.googleapis.com/database/network/connections"
            AND resource.labels.database_id="PROJECT_ID:checkout-db"' \
  --interval-start-time=2026-06-15T01:50:00Z \
  --interval-end-time=2026-06-15T02:15:00Z

Fix (mitigate first, then prevent).

# Throttle the fan-out so pools stop multiplying past the DB limit
gcloud run services update checkout-frontend \
  --region=europe-west1 --max-instances=20 --concurrency=80

# Give the DB more slots if the tier has the RAM for it
gcloud sql instances patch checkout-db \
  --database-flags=max_connections=400

Scenario B — Zonal/regional outage and failover

Symptom. The uptime check for app.example.com flaps, then goes hard red for one region. p99 latency in europe-west1 spikes; a third of GKE nodes show NotReady; some Compute Engine VMs are unreachable. Your other region, europe-west4, looks healthy. No deploys, no config changes.

Hypotheses.

  1. A zonal failure within the region (one zone’s hardware/power/network) — a subset of nodes and VMs in europe-west1-b.
  2. A regional Google Cloud incident.
  3. A self-inflicted regional problem (a bad config that happens to be regional, a regional quota).

Cross-service diagnosis. This is the one scenario where you check Personalised Service Health first and fast — because if it’s Google, no amount of debugging your code helps, and the correct action is failover.

# Is Google having an incident affecting MY project right now?
gcloud beta service-health events list \
  --project=PROJECT_ID \
  --filter='state="ACTIVE"' \
  --format='table(title, category, state, relevance, updateTime)'

Fix.

gcloud compute backend-services get-health checkout-backend --global \
  --format='table(status.healthStatus[].instance, status.healthStatus[].healthState)'
# Failover: send all traffic to the healthy region's backend
gcloud compute backend-services update-backend checkout-backend --global \
  --instance-group=ig-europe-west4 --instance-group-zone=europe-west4-a \
  --capacity-scaler=1.0
gcloud compute backend-services update-backend checkout-backend --global \
  --instance-group=ig-europe-west1 --instance-group-zone=europe-west1-b \
  --capacity-scaler=0.0

Scenario C — Org Policy / VPC Service Controls blast radius

Symptom. Across several projects at once, new deployments fail and some running workloads start throwing permission or connectivity errors. A data pipeline can no longer write to Cloud Storage; a Cloud Function deploy fails with a policy error; a VM creation is rejected. The errors are diverse but they all started at roughly the same minute, and they span project boundaries — which no single app bug ever does.

Hypotheses.

  1. An Org Policy constraint was changed at the organization or folder level (e.g. constraints/gcp.resourceLocations, iam.disableServiceAccountKeyCreation, compute.vmExternalIpAccess) and the constraint inherits down to all child projects.
  2. A VPC Service Controls perimeter change — a project added to a perimeter, a restricted service added, or an ingress/egress rule tightened — is now blocking managed-API calls that used to work.
  3. A broad IAM change at a high level removed a role (covered more in Scenario D, but it can co-occur).

Cross-service diagnosis. The tell is breadth across projects + simultaneity. That can only come from something above the project: the resource hierarchy. Go straight to the Organization/folder Audit Logs, not the project’s app logs.

# Org-level: who changed an Org Policy or a VPC-SC perimeter?
gcloud logging read '
  (protoPayload.methodName="google.cloud.orgpolicy.v2.OrgPolicy.UpdatePolicy"
   OR protoPayload.methodName:"accesscontextmanager")
  AND timestamp>="2026-06-15T08:00:00Z"
' --organization=ORG_ID --order=asc \
  --format='table(timestamp, protoPayload.authenticationInfo.principalEmail,
                  protoPayload.methodName, resource.type)'
gcloud logging read '
  protoPayload.status.details.violations.type="VPC_SERVICE_CONTROLS"
  AND timestamp>="2026-06-15T08:00:00Z"
' --project=PROJECT_ID --order=asc \
  --format='table(timestamp, protoPayload.methodName,
                  protoPayload.metadata.vpcServiceControlsUniqueIdentifier)'

Fix.

# Revert an Org Policy back to inheriting the parent (un-break children)
gcloud org-policies reset constraints/gcp.resourceLocations \
  --folder=FOLDER_ID

# Or restore from a saved policy file you exported before changes
gcloud org-policies set-policy /path/to/previous-policy.yaml

Scenario D — An IAM change that locks out a workload

Symptom. A batch service that has run for months suddenly fails. Cloud Run / GKE logs show PERMISSION_DENIED: caller does not have permission calling Cloud Storage or Pub/Sub. The code didn’t change; the service account didn’t change. It worked yesterday; it fails today.

Hypotheses.

  1. A role binding was removed from the service account (someone “cleaned up” IAM, or an over-broad binding was tightened at the project/folder level and the SA lost an inherited role).
  2. A deny policy or IAM Condition now matches and blocks the call (a deny overrides any allow; a time- or resource-bound condition stopped matching).
  3. A service account key was disabled/deleted or expired, or the SA itself was disabled.
  4. A higher-level (org/folder) policy change removed the role via inheritance (overlaps with Scenario C).
  5. Workload Identity Federation / Workload Identity mapping broke (wrong iam.serviceAccounts.actAs / roles/iam.workloadIdentityUser binding).

Cross-service diagnosis. The symptom is PERMISSION_DENIED from a previously-working identity, so the cause is an identity or policy change, not code. Two tools do the work: the Policy Troubleshooter (does this principal have this permission on this resource right now, and why/why not) and the Audit Logs (what changed).

gcloud policy-troubleshoot iam \
  //storage.googleapis.com/projects/_/buckets/batch-output \
  --principal-email=batch-runner@PROJECT_ID.iam.gserviceaccount.com \
  --permission=storage.objects.create

It returns GRANTED/NOT_GRANTED and the bindings it evaluated, including which deny rule or missing role is responsible — it walks inheritance for you.

gcloud logging read '
  (protoPayload.methodName="SetIamPolicy"
   OR protoPayload.methodName:"DenyPolicy"
   OR protoPayload.methodName:"DisableServiceAccount")
  AND (protoPayload.request.policy.bindings.members:"batch-runner@PROJECT_ID.iam.gserviceaccount.com"
       OR protoPayload.resourceName:"batch-runner")
  AND timestamp>="2026-06-14T00:00:00Z"
' --project=PROJECT_ID --order=asc \
  --format='table(timestamp, protoPayload.authenticationInfo.principalEmail,
                  protoPayload.methodName, protoPayload.resourceName)'
gcloud iam policies list --attachment-point=cloudresourcemanager.googleapis.com/projects/PROJECT_ID \
  --kind=denypolicies
gcloud iam service-accounts describe batch-runner@PROJECT_ID.iam.gserviceaccount.com \
  --format='value(disabled)'

Fix.

# Re-grant the exact role the Troubleshooter flagged as missing
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:batch-runner@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/storage.objectAdmin" --condition=None

Scenario E — A Cloud DNS / load-balancer / certificate failure

Symptom. Users report app.example.com is down or showing a browser TLS warning (NET::ERR_CERT_… / SSL_ERROR). Crucially, your backend metrics look healthy — low error rate, normal latency, normal but reduced traffic. The uptime check from outside is red; internal health checks are green. Requests aren’t reaching the backends.

Hypotheses.

  1. DNS: the record for app.example.com is wrong, missing, points at the old/changed LB IP, or a DNSSEC/delegation problem; or a Cloud DNS response policy is intercepting it.
  2. Load balancer: the forwarding rule / target proxy / URL map is misconfigured, the LB IP changed, or all backends are marked unhealthy by the health check (so the LB has nothing to send to — returns 502/503).
  3. Certificate: the Google-managed SSL cert is FAILED_NOT_VISIBLE / not yet ACTIVE (domain-validation DNS not in place), or a self-managed cert expired, or the cert doesn’t cover the SNI hostname.

Cross-service diagnosis. The signature — users can’t connect but backends are healthy and traffic is down — means the failure is in the front door, above your code. Work the request path from the outside in: DNS → LB front end → certificate → backends.

dig +short app.example.com            # what users get
gcloud compute forwarding-rules list --global \
  --format='table(name, IPAddress, target)'   # what the LB actually is
gcloud dns record-sets list --zone=example-zone --name=app.example.com.

If they differ, DNS is the cause (a record was edited, or the LB IP wasn’t reserved as static and changed on recreate).

gcloud compute ssl-certificates describe app-cert --global \
  --format='value(managed.status, managed.domainStatus, expireTime)'

FAILED_NOT_VISIBLE almost always means the DNS A/AAAA record for the domain isn’t pointing at this LB yet, so Google can’t complete domain validation — which ties DNS and cert together. A self-managed cert past expireTime is the other classic.

gcloud compute backend-services get-health app-backend --global
gcloud compute url-maps describe app-urlmap   # path/host rules correct?

A misconfigured health-check path (checking / when the app only answers /healthz) marks all backends down — the single most common “LB returns 502 but the app is healthy” cause.

Fix.

# Reserve the LB IP so it never changes again, then point DNS at it
gcloud compute addresses create app-lb-ip --global
gcloud dns record-sets update app.example.com. --zone=example-zone \
  --type=A --ttl=300 --rrdatas="$(gcloud compute addresses describe app-lb-ip --global --format='value(address)')"

# Fix the health check that was marking everything unhealthy
gcloud compute health-checks update http app-hc --request-path=/healthz --port=8080

Hands-on lab: build the war-room signals and find a planted change

You’ll stand up the observability you need before an incident, then practise the correlation move — using Audit Logs to find a change you plant yourself. Free-tier friendly; Cloud Monitoring and Logging have generous free allotments.

1. Set up.

export PROJECT_ID="$(gcloud config get-value project)"
gcloud services enable monitoring.googleapis.com logging.googleapis.com \
  cloudtrace.googleapis.com clouderrorreporting.googleapis.com \
  --project="$PROJECT_ID"

2. Create a uptime check + alerting policy (the detect phase). Create an uptime check on any public endpoint you own, then an alert policy that pages on failure. Via CLI you can create the alert from a JSON/YAML policy; the console wizard is fine too. Expected: the policy shows in gcloud alpha monitoring policies list.

gcloud alpha monitoring policies list --format='table(displayName, enabled, conditions[].displayName)'

3. Build a log-based metric for errors (a war-room signal). Count severity≥ERROR log entries so you can alert on an error spike:

gcloud logging metrics create app_error_count \
  --description="App errors (severity>=ERROR)" \
  --log-filter='severity>=ERROR'
gcloud logging metrics describe app_error_count   # validation

4. Plant a change and find it (the correlation drill). Make a deliberate, harmless IAM change, note the time, then hunt it in the Audit Logs exactly as you would in an incident:

# The "change" — grant then immediately remove a trivial role to a test SA
TS_BEFORE="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
gcloud projects add-iam-policy-binding "$PROJECT_ID" \
  --member="serviceAccount:$(gcloud iam service-accounts list --format='value(email)' --limit=1)" \
  --role="roles/browser" --condition=None >/dev/null

# Now find it — who did what, since TS_BEFORE
gcloud logging read "
  protoPayload.methodName=\"SetIamPolicy\"
  AND timestamp>=\"$TS_BEFORE\"
" --project="$PROJECT_ID" --order=asc \
  --format='table(timestamp, protoPayload.authenticationInfo.principalEmail, protoPayload.methodName)'

Expected output: a row showing your principal email and SetIamPolicy at the time you ran it. That is the exact muscle you use in Scenarios C and D to find the contributing change.

5. Explore the four pillars. Open Logs Explorer (run severity>=ERROR over the last hour), Cloud Trace (if you have a traced service), Error Reporting (it auto-populates from logged exceptions), and Personalised Service Health (Console → Service Health) to see whether any Google incident is active for the project.

Cleanup

# Remove the planted binding and the lab artefacts
gcloud projects remove-iam-policy-binding "$PROJECT_ID" \
  --member="serviceAccount:$(gcloud iam service-accounts list --format='value(email)' --limit=1)" \
  --role="roles/browser" --condition=None
gcloud logging metrics delete app_error_count --quiet
# Delete the uptime check + alert policy from the Console (Monitoring → Uptime / Alerting)

Cost note

Cloud Logging includes the first 50 GiB of ingestion per project per month free; Cloud Monitoring metrics and the first uptime checks are free within generous limits; Cloud Trace and Error Reporting have free monthly allotments. This lab stays comfortably inside the free tier — the only thing that ever surprises people is log ingestion volume at scale, so set a logs exclusion filter for chatty, low-value logs and a budget alert in real projects.

Common mistakes & troubleshooting

Symptom Cause Fix
Chasing the root cause while users stay down Treating an incident like a debugging session Mitigate first (rollback/failover/scale), RCA after; the IC enforces this
Five people, five theories, no progress No Incident Commander; no single source of truth Name an IC who coordinates (doesn’t type); one incident doc with the timeline
Spending an hour on your code, then finding it was Google Didn’t check Personalised Service Health early Rule out the platform first for any broad/regional symptom
“Nothing changed” but everything broke at once A config/Org Policy/IAM change you didn’t make Query Admin Activity Audit Logs in the incident window — find the change
Errors that span multiple projects blamed on one app App bugs don’t cross project boundaries Breadth + simultaneity ⇒ look above the project (org/folder policy, VPC-SC)
LB returns 502/503 but the app is healthy Health-check path/port wrong ⇒ all backends marked unhealthy Fix the health-check probe to hit a real healthy endpoint
A retry made the outage worse Retry storm amplifying load on a failing dependency Add jittered backoff + circuit breakers; cap concurrency
Postmortem names one person as “the cause” Blameful culture; stops at human error Blameless framing: ask why the system let the mistake cause an outage

Best practices

Security notes

Interview & exam questions

Q1. In an incident, what do you do first — find the root cause or mitigate? Mitigate. Restore service for users (rollback, failover, scale, raise quota, feature-flag) before fully understanding the cause. Root-cause analysis happens after the bleeding stops. Leading with RCA prolongs the outage.

Q2. What does the Incident Commander actually do? Coordinates the response — decides, delegates, time-boxes hypotheses, maintains the single source of truth, and runs comms cadence. The IC does not do the hands-on fixing; they keep the room organised and the team focused on one mitigation at a time.

Q3. Errors and latency are up but traffic is roughly flat. What class of problem is that, versus errors up with traffic up? Flat traffic + rising errors/latency ⇒ you hit a ceiling (quota/connection/limit) or a dependency is failing. Rising traffic + rising errors/latency/saturation ⇒ overload. The traffic shape distinguishes “we broke” from “we got flooded”.

Q4. Which Google Cloud tool tells you whether Google itself is having an incident affecting your project, and why check it early? Personalised Service Health (PSH) — project-scoped, relevance-filtered, with an API and alerts. Check it early for any broad/regional symptom because if it’s a Google outage, the correct response is failover, not debugging your code.

Q5. Several projects start failing at the same minute with different errors. Where do you look and why? Above the project — the Organization/folder Audit Logs for Org Policy (SetOrgPolicy) and VPC Service Controls perimeter changes. Breadth across projects plus simultaneity can’t come from one app bug; it comes from an inherited policy/perimeter change.

Q6. How do you tell an Org Policy denial from a VPC Service Controls denial in the logs? Org Policy reads as Constraint constraints/… violated / FAILED_PRECONDITION on create/modify calls. VPC-SC reads as Request is prohibited by organization's policy with a vpcServiceControlsUniqueIdentifier and a violation of type VPC_SERVICE_CONTROLS. VPC-SC denials are separately queryable by that violation type.

Q7. A service account that worked yesterday now gets PERMISSION_DENIED. Walk through the diagnosis. The cause is an identity/policy change, not code. Use the Policy Troubleshooter to see whether the permission is granted now and which binding/deny is responsible, then query Audit Logs for SetIamPolicy, deny-policy changes, or DisableServiceAccount(Key) in the window. Check for a deny policy and IAM Conditions explicitly, and confirm the SA/key isn’t disabled. Fix with the specific missing role.

Q8. Why roll out VPC Service Controls and Org Policies in dry-run mode first? Dry-run logs what would be blocked without actually blocking it, so you see the full blast radius in the violation logs and fix legitimate flows (ingress/egress rules, access levels) before enforcing. It’s the single best defence against the Scenario-C blast radius.

Q9. Users get a TLS error but your backends are healthy and traffic is down. Where is the fault and how do you confirm? In the front door, above the app: DNS, the load-balancer front end, or the certificate. Confirm by working outside-in — dig the hostname vs the LB’s actual (static) IP, check the managed SSL cert status (ACTIVE vs FAILED_NOT_VISIBLE/expired), then backend health and the URL map. FAILED_NOT_VISIBLE usually means DNS isn’t pointing at the LB so validation can’t complete.

Q10. What is a retry storm and how do you stop one amplifying an incident? When a dependency slows or fails, naive clients retry, multiplying load on the already-struggling dependency and accelerating collapse. Stop it with jittered exponential backoff, retry budgets, and circuit breakers; during the incident, cap concurrency/instances to reduce the fan-out.

Q11. What makes a postmortem “blameless”, and why does it matter? It focuses on why the system allowed a mistake to cause an outage, not who made it — assuming people acted reasonably with the information they had. It matters because blame drives engineers to hide information, which destroys the learning; psychological safety produces honest timelines and real fixes.

Q12. What’s the difference between a contributing factor and a single root cause, and why prefer the former? Complex incidents rarely have one cause; they’re a chain (a latent limit + an amplifier + a trigger, e.g. Scenario A). Listing contributing factors yields multiple independent fixes (alert on saturation and add pooling and add backoff), which is more resilient than fixing one “root cause” and calling it done.

Quick check

  1. True/False: during a SEV1 the Incident Commander should be the one running the most gcloud commands.
  2. You see rising errors and latency with flat traffic. Ceiling/limit problem, or overload?
  3. Which single tool tells you if a Google regional incident is affecting your project?
  4. Errors across five projects starting at the same minute — project app logs, or org/folder Audit Logs?
  5. A managed SSL certificate is stuck at FAILED_NOT_VISIBLE. What’s the most likely cause?

Answers

  1. False. The IC coordinates and decides; they delegate the hands-on work so they can steer.
  2. A ceiling/limit problem (quota, connections, rate limit) or a failing dependency — not overload (which would show rising traffic/saturation).
  3. Personalised Service Health (PSH).
  4. Org/folder Audit Logs — breadth + simultaneity points above the project (Org Policy / VPC-SC).
  5. The domain’s DNS A/AAAA record isn’t pointing at this load balancer, so Google can’t complete domain validation.

Exercise

Take one of your own (or a lab) workloads and write a one-page incident runbook for a single scenario from this lesson — say, regional outage or quota exhaustion. It must contain: (1) the detection signal and the exact alert that should fire; (2) the first three diagnostic commands/queries (including the PSH check and the “what changed” Audit Log query); (3) the mitigation step and how to verify it worked; (4) the rollback/abort criterion (“if X doesn’t improve in N minutes, do Y”). Then run a 30-minute game day: have a colleague inject the failure (drain a zone, lower a quota, point DNS wrong) without telling you the details, and execute your runbook. Afterwards, write a short blameless postmortem — timeline, contributing factors, and three CAPA items with owners and dates. You now have both a tested runbook and a postmortem template.

Certification mapping

This lesson maps primarily to the Professional Cloud DevOps Engineer (PCDE) exam — incident response, SLO/SLI and error budgets, blameless postmortems, monitoring/logging/tracing, and reducing toil are core domains — and to the Professional Cloud Architect (PCA) exam’s reliability, observability and operational-excellence themes (the case studies frequently probe failover, DR and capacity/quota planning). The audit-log forensics and VPC-SC/Org Policy/IAM diagnosis also reinforce material on the Professional Cloud Security Engineer (PCSE) exam.

Glossary

Next steps

You can now run a complex Google Cloud incident end to end and turn it into prevention. Next, step up from operating systems to designing them: The Google Cloud Architecting Ladder: From a Static Site to Multi-Region Global (gcp-architecting-ladder-static-site-to-multi-region) — which teaches the resilient, regional-by-default architectures that make most of the incidents in this lesson impossible in the first place. To go deeper on the security controls that featured here, revisit VPC Service Controls (gcp-vpc-service-controls-perimeters-exfiltration-prevention) and IAM deny policies, conditions & impersonation chains (gcp-iam-deny-policies-conditions-impersonation-chains).

gcpincident-responseobservabilityrcasretroubleshooting
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading