Architecture GCP

GCP Landing Zone: Operations & Billing — Cloud Logging Sinks & Buckets, Cloud Monitoring, Billing Export & Budgets, and Org-Wide Observability

Where this fits

By Part 5, the landing zone already has its resource hierarchy (Part 1), identity (Part 2), networking (Part 3), and security and detective controls (Part 4) — and Operations & Billing is the layer that makes the running estate observable and accountable. Everything underneath generates two firehoses no enterprise can run blind without: operational telemetry (logs, metrics, traces) and financial telemetry (cost). This phase decides where both firehoses land, who can query them, how long they are retained, and what pages a human when something breaks the budget or the SLO. Because the resource hierarchy lets you aggregate at the Organization node, Operations & Billing is the one phase where you get a genuine single pane of glass — one set of log buckets, one set of monitoring scopes, one BigQuery billing dataset — covering every project the foundation will ever create.

Google Cloud Landing Zone Design — animated overview

Cloud Logging — aggregated sinks and log buckets

What it is

Cloud Logging is GCP’s managed log-management service. Every log entry flows first into the Log Router, which evaluates a set of sinks and decides where each entry goes. A sink is filter → destination: an inclusion filter (and optional exclusion filters) written in the Logging query language, pointing at one of four destination types — a Cloud Logging bucket, a BigQuery dataset, a Cloud Storage bucket, or a Pub/Sub topic. Two sinks exist in every project by default: _Required (admin/audit activity, fixed 400-day retention, cannot be modified or disabled) and _Default (everything else, routed to the _Default log bucket, 30-day retention, editable).

The landing-zone pivot is the aggregated sink: a sink created not on a project but on a folder or the Organization node, with the includeChildren flag set true, so it captures logs from every project beneath it and routes them to a central destination. This is the mechanism that turns thousands of per-project log streams into one governed log estate.

Why it matters

Logs that stay scattered in each team’s _Default bucket are useless for security, compliance, and incident response — an investigator would have to query 70 projects individually, and a team with project-level access could tamper with or delete the evidence of their own actions. Centralizing logs solves three problems at once:

How to do it well

Concrete artifacts, decisions, and tools

Sink destination Best for Retention / query model Cost note
Cloud Logging bucket (Log Analytics on) Interactive security search + retention Custom retention (to 3650 days); query via Logs Explorer + BigQuery SQL Storage billed per GiB beyond the free _Default
BigQuery dataset SQL joins, dashboards, long-term analytics Table-per-day; standard BQ retention/partition expiry Pay for streaming insert + BQ storage
Cloud Storage bucket Cheap cold archive, WORM compliance Lifecycle rules; Bucket Lock for immutability Cheapest per-GB; not queryable directly
Pub/Sub topic Streaming to SIEM (Chronicle/Splunk) Transient; consumer-defined Pay per message; real-time
Artifact / decision GCP service or tool Notes
Central log project Cloud Logging in a dedicated common-folder project Locked down to platform/security
Org aggregated sinks google_logging_organization_sink (include_children = true) One per destination/purpose
Audit log bucket Logging bucket + retention lock 400+ days, immutable
Log Analytics Log Analytics-enabled bucket + linked BigQuery dataset SQL over logs without data duplication
Noise control Exclusion filters, Data Access log config Drop/sample low-value logs
SIEM feed Pub/Sub → Chronicle / Google SecOps Real-time detection pipeline

Cloud Monitoring — metrics, SLOs, and alerting at org scale

What it is

Cloud Monitoring collects metrics, uptime checks, dashboards, SLOs, and alerting for GCP (and AWS/on-prem via the Ops Agent). Its central organizing concept is the metrics scope (formerly “Workspace”). A metrics scope lives in a scoping project and lists the projects whose metrics it can see. By default a project’s scope contains only itself; the landing-zone move is to nominate one or a few scoping projects that have many monitored projects added to their scope, giving an SRE team a single project from which to dashboard and alert across dozens of workloads. (A scoping project can hold up to 375 monitored projects, so very large estates use a small number of scoping projects, typically split by environment.)

Why it matters

Without a deliberate scope design, every team can only see its own project’s metrics and there is no org-wide view of “is the platform healthy?” Centralizing monitoring delivers:

How to do it well

Concrete artifacts, decisions, and tools

Discipline GCP tool / mechanism KPI to track
Org-wide metric visibility Metrics scope + scoping projects % of prod projects in a monitored scope (→ 100%)
Golden-signal collection Ops Agent, Managed Service for Prometheus % of fleet running the Ops Agent
Reliability targets Cloud Monitoring SLOs + error budgets SLO attainment vs target; error-budget burn
Proactive alerting Alerting policies (multi-burn-rate) Alert precision (page-to-incident ratio)
External availability Uptime checks Uptime % per critical endpoint
On-call routing Notification channels (PagerDuty/Slack/Pub/Sub) MTTA / MTTR

Billing export and budgets — making cost a first-class signal

What it is

A Cloud Billing account pays for one or more projects and is the root of all cost data. By itself the Console gives you reports, but the landing-zone foundation turns cost into queryable, alertable data through two features:

A budget alert by itself does not cap spend — GCP has no hard spending cap. The Pub/Sub channel is what lets you act (notify Slack, open a ticket, or in non-prod even disable billing on a runaway project via Cloud Functions).

Why it matters

In an estate of dozens of projects and INR-sensitive budgets, cost surprises come from idle GPUs, forgotten dev environments, egress, and over-provisioned databases. Billing export + budgets make cost a monitorable signal rather than a month-end shock:

How to do it well

Concrete artifacts, decisions, and tools

Capability GCP service / mechanism Output
Row-level cost data BigQuery billing export (Standard / Detailed / Pricing) Per-SKU, per-project, per-label cost tables
Cost thresholds Cloud Billing budgets (amount or % of prior month) 50/90/100% forecast alerts
Programmatic response Budget → Pub/Sub + Cloud Function Slack/ticket; non-prod billing disable
Attribution Labels (env,team,cost-center,app) FinOps slicing in BigQuery/Looker
Optimization Active Assist / Recommender, CUDs, Spot VMs Idle/rightsize/CUD recommendations
Anomaly detection Cost Anomaly Detection Alert on unexpected spend spikes
Reporting Looker Studio on the export dataset Trended FinOps dashboards

Centralized observability across the org

What it is

The previous three sub-components each produce a central destination; centralized observability is the discipline of wiring them into one coherent operating model so that logs, metrics, traces, errors, and cost are all aggregated, correlated, and access-controlled from the top of the hierarchy. The GCP building blocks beyond Logging and Monitoring are Cloud Trace (distributed tracing), Error Reporting (aggregated exceptions), Cloud Profiler (continuous CPU/heap profiling), and the Google Cloud Observability dashboards that stitch them together — all feeding, or fed by, the central logging project, the monitoring scoping projects, and the billing/FinOps project.

Why it matters

Centralization is what converts data you happen to collect into an operating capability:

How to do it well

Observability plane Central destination (in common) Primary owner Key GCP tools
Logs / audit Logging-sink project + SIEM project Security Aggregated org sinks, Log Analytics, Pub/Sub→Chronicle
Metrics / SLOs Monitoring scoping projects SRE Metrics scope, SLOs, alerting, Ops Agent, Prometheus
Traces / errors / profiles Per-app, surfaced centrally Service teams + SRE Cloud Trace, Error Reporting, Cloud Profiler
Cost Billing/FinOps project FinOps BigQuery billing export, budgets, Recommender, Looker

Real-world enterprise scenario

Sahyadri Health Networks is a fictional ₹3,000-crore hospital and diagnostics group (≈US$360M revenue, ~600 engineers) running patient-facing apps, a diagnostics-imaging pipeline, and a clinical data lake on GCP. They are bound by Indian data-residency rules and DPDP, must keep audit trails for clinical systems for 7 years, and have been burned twice by runaway dev spend on GPU notebooks for imaging ML. Their hierarchy (from Part 1) is the hybrid pattern with bootstrap and common folders and three product domains — Patient Apps, Imaging, and Clinical Data — across prod / nonprod / dev, ~55 projects total.

Cloud Logging. In the common folder they stand up prj-logging, owned by the security team. Two organization-level aggregated sinks (include_children = true) route everything: one to a Log Analytics-enabled Logging bucket (audit-bucket, pinned to asia-south1, retention 2,555 days = 7 years, retention lock applied) for tamper-proof clinical audit trails, and one to a BigQuery dataset for analytics. A third sink streams to a Pub/Sub topic feeding Google SecOps (Chronicle) for real-time detection. Data Access audit logs are enabled on the Clinical Data projects and routed separately; exclusion filters drop chatty health-check and load-balancer logs to control cost. Workload teams retain only their _Default 30-day buckets; they cannot touch the central audit bucket.

Cloud Monitoring. Two scoping projects — prj-mon-prod and prj-mon-nonprod — sit in common; all 55 workload projects are added to the appropriate scope. The Ops Agent is baked into the golden VM image and GKE node pools. SRE defines SLOs on the patient-booking and imaging-upload services (99.9% availability, p95 latency < 400 ms) with multi-burn-rate error-budget alerting routed through PagerDuty and a #sre-oncall Slack channel via centralized notification channels. Uptime checks watch the public booking and portal endpoints; the imaging team’s existing Prometheus exporters feed Managed Service for Prometheus so they keep PromQL.

Billing export & budgets. prj-finops in common hosts the Detailed BigQuery billing export (resource-level, so they can see per-GPU-VM cost). Budgets are layered: one org-wide budget, three per-domain (label-scoped) budgets, and tight per-project budgets on every development project at ₹50,000/month forecast. Every budget publishes to Pub/Sub; a Cloud Function routes 90% alerts to Slack and, in dev sandboxes only, disables billing on a project that blows past 150% — directly killing the runaway-GPU pattern. A Looker Studio dashboard slices spend by env, team, cost-center, with month-over-month trend; Recommender CUD and idle-VM suggestions feed a monthly FinOps review.

Centralized observability. All teams emit structured JSON logs with trace/spanId, so the console auto-links logs to Cloud Trace spans and Error Reporting groups. Read access is group-based: gcp-security-auditors on prj-logging, gcp-sre-oncall on the monitoring scopes, gcp-finops on prj-finops — none of them needing access to workload projects. The whole wiring is in the foundation Terraform so every newly factory-created project inherits the org sink, joins a monitoring scope, and carries cost labels automatically.

Artifacts produced. A dedicated prj-logging, prj-mon-{prod,nonprod}, and prj-finops set of projects; two org aggregated sinks + a SIEM Pub/Sub sink as google_logging_organization_sink; a retention-locked 7-year audit bucket; metrics-scope membership for all 55 projects; SLOs and multi-burn-rate alert policies as code; a Detailed BigQuery billing export; layered budgets with a Pub/Sub-triggered guardrail Cloud Function; and a Looker Studio FinOps dashboard.

Measurable outcome (6 months): audit-log coverage went from per-project and mutable to 100% centralized, immutable for 7 years — accepted by their clinical-systems auditor as the trail of record. Mean time to acknowledge production incidents dropped from ~25 minutes to under 4 minutes with centralized golden-signal dashboards and burn-rate paging. FinOps now attributes 100% of spend by team and environment, and the dev per-project budgets + auto-disable guardrail cut idle dev/GPU spend by ~44%, eliminating the runaway-notebook surprises entirely.

Deliverables & checklist

Common pitfalls

  1. Leaving logs in per-project _Default buckets. Investigators must hunt across dozens of projects, and a team can delete the evidence of its own actions. Avoid: org-level aggregated sinks into a locked-down central logging project, with a retention lock on the audit bucket.
  2. Turning on Data Access audit logs everywhere without routing them. They are enormous and can dominate the Logging bill. Avoid: enable Data Access logs only where compliance requires, route them to their own (cheaper) destination, and use exclusion filters to sample.
  3. Forgetting that billing export only populates forward. Teams enable export months into the build and have no historical cost data to baseline against. Avoid: enable BigQuery billing export on day one, before workloads land.
  4. Assuming a budget caps spend. A budget alert is informational; GCP has no hard cap, so a runaway job keeps spending after the 100% email. Avoid: wire budgets to Pub/Sub and a response function (Slack/ticket, and billing-disable in non-prod) so alerts trigger action.
  5. Alerting on raw thresholds instead of error budgets. Static thresholds either page constantly (fatigue) or miss slow burns. Avoid: define SLOs and use multi-window, multi-burn-rate alerting that catches fast and slow burns with far fewer false pages.
  6. Not adding new projects to the monitoring scope / sink. A project created after the foundation was built silently has no central metrics, alerts, or labels. Avoid: bake sink inheritance, scope membership, and cost labels into the project-factory so observability is automatic by construction.

What’s next

Part 6 of Google Cloud Landing Zone Design turns to Platform Automation & the Foundation Pipeline — the Terraform foundation, Cloud Build/Infrastructure Manager CI/CD, the project factory, and the policy-as-code that provisions and governs every layer you have built across this series.

GCPLanding ZoneOperations & BillingEnterprise
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

// part 5 of 5 · Google Cloud Landing Zone Design

Keep Reading