GCP Lesson 28 of 98

Google Cloud Operations Suite, In Depth: Cloud Monitoring, Logging, Trace & Error Reporting

You cannot operate what you cannot see. Every other lesson in this course builds something — a VM, a Cloud Run service, a BigQuery dataset, a GKE cluster — and the moment that thing is in production, the questions change from “how do I create it?” to “is it healthy, is it fast, is it about to fall over, and when it broke at 03:00 what actually happened?”. Answering those questions is the job of the Google Cloud Operations suite, the umbrella name for Google’s built-in observability stack: Cloud Monitoring, Cloud Logging, Cloud Trace, Cloud Profiler, and Error Reporting, fed by the Ops Agent on VMs and by automatic instrumentation everywhere else.

If you have been around Google Cloud for a while you will know this suite by its old name, Stackdriver — Google acquired the company Stackdriver in 2014, rebranded the products to “Cloud Monitoring”, “Cloud Logging” and so on around 2020, and grouped them under “Google Cloud’s operations suite”. The Stackdriver name is gone from the console but lingers in old blog posts, in some API names, and in the muscle memory of every engineer who has been doing this longer than five years. When you see “Stackdriver Monitoring” anywhere, read it as “Cloud Monitoring”.

This lesson is deliberately exhaustive. The goal is that after reading it once you understand every important option in the suite — every metric type, the difference between MQL and PromQL, every field on an alerting policy, what a metrics scope actually is, the full log-routing model from log entry to sink to bucket, log-based metrics, Log Analytics, the three application-performance tools, and the audit-log taxonomy that auditors will ask you about. That is enough to operate these services in production, answer an interviewer’s probing follow-ups, and pass the observability sections of the Associate Cloud Engineer (ACE) and Professional Cloud DevOps Engineer (PCDE) exams.

Learning objectives

By the end of this lesson you will be able to:

Prerequisites

You should be comfortable with the Google Cloud resource hierarchy (organisation → folders → projects), with IAM roles and service accounts, and with the gcloud CLI from the first-steps lesson. A throwaway project with the $300 free-trial credit (or any project you can spend a few rupees in) is enough for the lab. In the Zero-to-Hero course this is the Operations lesson of the Intermediate tier — it is the capability that makes every prior compute, storage, data and networking lesson operable, and it is assumed by the troubleshooting playbooks that follow it.

Core concept: the suite and where each piece fits

The Operations suite is five products that share one mental model: signals about your workloads flow into Google-managed backends, where you can query, visualise and alert on them. It is useful to anchor each product to the classic “three pillars of observability” plus a couple of extras:

Product Pillar / job What it stores Free allotment (per billing account, monthly)
Cloud Monitoring Metrics Time series (numbers over time) All Google Cloud metrics free; first 150 MiB of chargeable (custom/agent/Prometheus) metrics free
Cloud Logging Logs Log entries (structured events) First 50 GiB of log ingestion free
Cloud Trace Traces Latency spans of distributed requests First 2.5 million spans ingested free
Cloud Profiler Continuous profiling CPU/heap flame graphs from production Free — no charge for Profiler
Error Reporting Error aggregation Grouped, deduplicated exceptions Free — built on Logging

Two architectural facts shape everything else and are worth fixing in your mind now:

  1. Most Google Cloud signals are collected automatically. Every Google Cloud service emits platform metrics (CPU, request count, queue depth) to Cloud Monitoring and many emit logs to Cloud Logging without you installing anything. You install the Ops Agent only to get guest-level signals from inside a VM — memory, disk, process, and application logs — which the hypervisor cannot see. Serverless products (Cloud Run, Cloud Functions, App Engine, GKE Autopilot) need no agent at all.
  2. Monitoring is organised around a “scope”; logging is organised around “routing”. These are the two ideas people get wrong. A metrics scope is a list of projects whose metrics a single “scoping project” can see — that is how you build a single pane of glass across many projects. Log routing is the pipeline log entry → Log Router → sink → destination that decides where each log line is stored, copied, or dropped. Get those two right and the rest is detail.

Core concept: metric types, resources, and metrics scopes

A metric in Cloud Monitoring is a named measurement with a defined kind and value type — for example compute.googleapis.com/instance/cpu/utilization. A time series is the stream of timestamped values for one metric for one specific resource (this VM, that bucket), distinguished by labels (key/value pairs such as instance_name, zone).

There are several families of metrics, and knowing which is which tells you what is free and what you must instrument:

Metric family Source Examples Cost
Google Cloud (platform) metrics Emitted by Google services automatically compute…/cpu/utilization, loadbalancing…/https/request_count, pubsub…/subscription/num_undelivered_messages Free
Agent metrics Ops Agent inside a VM agent.googleapis.com/memory/percent_used, …/disk/percent_used Chargeable (counts toward the 150 MiB free tier)
Custom metrics Your code via the API/OpenTelemetry, prefix custom.googleapis.com/ or workload.googleapis.com/ custom…/orders_processed Chargeable
Prometheus metrics Google Cloud Managed Service for Prometheus prometheus.googleapis.com/… Chargeable, priced per sample
External / BindPlane metrics Third-party integrations varies Chargeable

Every metric also has a metric kind and value type, and you must understand the kinds because they change how you query and chart them:

Value types are BOOL, INT64, DOUBLE, STRING, DISTRIBUTION (a histogram — latency metrics are distributions, which is what lets you compute the 95th/99th percentile), and MONEY.

Monitored resources

Every time series is attached to a monitored resource — a typed object such as gce_instance, gae_app, k8s_container, cloud_run_revision, gcs_bucket, or the catch-all global. The resource type plus its labels (project, location, instance id) is how Monitoring knows what a number describes. When you write queries you filter and group by these resource labels constantly.

Metrics scopes (the multi-project single pane of glass)

By default a project’s Monitoring console shows only that project’s metrics. A metrics scope changes this: it is a configuration object owned by one scoping project that lists the projects (and AWS accounts, historically via connected accounts) whose metrics are visible together. The classic pattern is to create a dedicated monitoring/observability project, make it the scoping project, and add all your workload projects to its scope — now one dashboard, one alerting policy, one Logs-adjacent view spans the estate.

Key rules an interviewer will check:

# Add a workload project into a monitoring project's metrics scope
gcloud beta monitoring metrics-scopes create \
  projects/WORKLOAD_PROJECT_ID \
  --project=OBSERVABILITY_PROJECT_ID

Cloud Monitoring querying: Console, MQL and PromQL

There are three ways to ask Cloud Monitoring a question, and a serious engineer knows all three.

1. The Metrics Explorer (visual query builder). In the console you pick a metric, choose how to align each series into a regular grid (the alignerrate, mean, max, percentile_99, etc., over an alignment period like 60s), and then reduce/group across series (the reducersum, mean, count, group by zone). Alignment then reduction is the universal two-step of time-series math; every backend forces a series onto a regular time grid (alignment) before combining series (reduction).

2. Monitoring Query Language (MQL). MQL is Google’s own pipe-based query language, powerful for ratios, joins across metrics, and time-shifted comparisons that the visual builder cannot express. It reads left to right with | stages:

fetch gce_instance
| metric 'compute.googleapis.com/instance/cpu/utilization'
| align rate(1m)
| every 1m
| group_by [resource.zone], [mean: mean(value.utilization)]

A ratio — error rate as a fraction of total requests — shows why MQL exists:

fetch https_lb_rule
| metric 'loadbalancing.googleapis.com/https/request_count'
| filter metric.response_code_class = 500
| align rate(1m)
| group_by [resource.url_map_name], [errs: sum(value.request_count)]
| join
  (fetch https_lb_rule
   | metric 'loadbalancing.googleapis.com/https/request_count'
   | align rate(1m)
   | group_by [resource.url_map_name], [total: sum(value.request_count)])
| value [ratio: errs / total]

3. PromQL (via Managed Service for Prometheus). If your team comes from Kubernetes/Prometheus, you can query Cloud Monitoring with PromQL — both your Prometheus-format metrics and Google Cloud metrics (which are mapped into the Prometheus data model). This is enormously valuable because you can reuse existing Grafana dashboards and Prometheus alerting rules unchanged:

# 99th percentile request latency over 5 minutes
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

When to choose which:

Need Use
Quick ad-hoc chart, no syntax Metrics Explorer (visual)
Ratios, joins, time-shift, complex Google-metric math MQL
Reuse Prometheus/Grafana, Kubernetes-native team, counters with rate()/histogram_quantile() PromQL

Note the direction of history: Google is investing heavily in Managed Service for Prometheus and PromQL as the strategic query path, especially for GKE. MQL remains fully supported and is still the most expressive option for cross-metric Google-Cloud math, but if you are starting fresh on Kubernetes, lead with PromQL.

Cloud Monitoring: dashboards

A dashboard is a saved collection of widgets. There are two kinds:

Each chart can be backed by the visual builder, MQL, or PromQL. Dashboards support dashboard-level filters and template variables (e.g. a zone or namespace picker that re-scopes every widget at once), and a time-range control shared across widgets. Crucially, dashboards are resources you can define as code — export any dashboard to JSON and manage it in Terraform via google_monitoring_dashboard, which is the right way to keep dashboards reproducible across environments.

gcloud monitoring dashboards create --config-from-file=dashboard.json
gcloud monitoring dashboards list

Cloud Monitoring: alerting policies

An alerting policy is the rule that decides when something is wrong and who finds out. Master its anatomy because every field is fair game in an exam and in production.

A policy contains one or more conditions, combined with a policy-level combiner (AND / OR — trigger when any or all conditions are met). Each condition has a type:

Condition type What it watches Typical use
Metric threshold A metric crosses above/below a value CPU > 80%, error rate > 1%, queue depth > 1000
Metric absence A metric stops reporting for a duration A job that should emit a heartbeat went silent
Forecast (predictive) Projected to cross a threshold within a window Disk will fill within 4 hours
MQL / PromQL condition An arbitrary query result crosses a threshold Ratios, multi-metric logic, Prometheus rules
Log-based (log match) A matching log entry appears “FATAL” log line, a specific audit event
SLO burn-rate An SLO’s error budget is burning too fast SRE-style alerting (covered below)
Uptime-check failure An uptime check fails from N locations Public endpoint is down

For a metric-threshold condition the fields you set are:

Policy-level settings:

Notification channels

A notification channel is a destination configured once and reused by many policies:

Channel Notes
Email Simplest; good for low-volume
SMS Verified numbers
Slack Via the Monitoring Slack integration
PagerDuty / other on-call The production answer for paging
Pub/Sub Programmatic — fan out to anything (Cloud Functions, ServiceNow, custom)
Webhook POST to an HTTP endpoint (optionally with basic auth)
Google Chat Webhook into a space
Mobile app The Google Cloud mobile app push
# Create an email channel, then a CPU-threshold policy that uses it
gcloud beta monitoring channels create \
  --display-name="SRE email" --type=email \
  --channel-labels=email_address=sre@example.com

gcloud alpha monitoring policies create --policy-from-file=cpu-policy.yaml

Best practice: define channels and policies as code (Terraform google_monitoring_notification_channel, google_monitoring_alert_policy). Alert config drifting between environments is a classic cause of “prod didn’t page”.

Cloud Monitoring: uptime checks

An uptime check actively probes an endpoint from Google’s global infrastructure and reports availability and latency. You configure:

An uptime check does nothing on its own — pair it with an uptime-check-failure alerting condition so a failure pages someone. Uptime checks can also reach private endpoints when configured with the appropriate connectivity.

gcloud monitoring uptime create "homepage-https" \
  --resource-type=uptime-url \
  --resource-labels=host=example.com \
  --protocol=https --path=/ --port=443 --period=5

Cloud Monitoring: SLOs, SLIs and error budgets

This is the SRE heart of the suite and a guaranteed PCDE exam topic. The vocabulary:

In Cloud Monitoring you create a Service (which can be auto-detected for Cloud Run, GKE, App Engine, or custom), then attach SLOs to it. For each SLO you choose:

Then you create burn-rate alerting policies. The recommended pattern is multi-window, multi-burn-rate alerts: a fast-burn alert (e.g. burn rate ≥ 14.4 over a 1-hour window → page immediately, you’d exhaust a month’s budget in ~2 days) and a slow-burn alert (e.g. burn rate ≥ 1 over 6 hours → ticket, not a page). This gives you urgency without flapping.

# List services Monitoring knows about, then inspect SLOs on one
gcloud monitoring services list
gcloud monitoring slos list --service=SERVICE_ID

Interview gold: “What’s the difference between an SLA, an SLO and an SLI?” — An SLI is the measurement, an SLO is your internal target, an SLA is the contractual promise to a customer (with penalties) — and your SLO should be stricter than your SLA so you have warning before you breach the contract.

Cloud Logging: the routing model

Cloud Logging’s architecture confuses people until they see the pipeline drawn once. Every log entry, the moment it is ingested by the Log Router, is compared against the inclusion/exclusion filters of every sink in the resource. Sinks decide where entries go.

log entry  →  Log Router  →  [each sink's filter]  →  destination(s)
                                                      ├─ Logging bucket (default)
                                                      ├─ BigQuery dataset
                                                      ├─ Cloud Storage bucket
                                                      ├─ Pub/Sub topic
                                                      └─ another GCP project / Logging bucket

Log buckets

Logs are stored in log buckets — Logging-specific storage, not to be confused with Cloud Storage buckets. Every project has two created automatically:

Bucket Contents Retention Can you delete/change it?
_Required Admin Activity, System Event and Access Transparency audit logs 400 days, fixed No — cannot be modified or deleted, and is free
_Default Everything else that isn’t routed elsewhere 30 days by default (configurable 1–3650 days) Retention is editable; the bucket itself can be disabled via the _Default sink

You can create your own user-defined buckets with custom retention (1 to 3650 days), choose their region (data residency), and enable two important options:

Sinks

A sink has an inclusion filter (a Logging-query-language expression selecting which entries it captures), optional exclusion filters, and a destination. Two automatic sinks exist: _Required (routes the mandated audit logs to the _Required bucket; immutable) and _Default (routes everything else to the _Default bucket; you can edit or disable it — e.g. add an exclusion to stop ingesting noisy logs and save money).

Destinations:

Destination Why Watch out for
Logging bucket Keep in Logging, custom retention/region, enable Log Analytics The default home
BigQuery dataset SQL analytics, joins with business data, long retention Streaming-insert/storage cost; schema per log type
Cloud Storage bucket Cheap long-term/compliance archive Hourly batched objects; not query-friendly raw
Pub/Sub topic Stream to Splunk/Datadog/SIEM or custom processing You build the consumer
Another project / log bucket Centralisation Needs cross-project IAM (granted automatically to the sink’s writer identity)

Every sink runs under a writer identity (a service account, often p<project-number>-…@gcp-sa-logging.iam.gserviceaccount.com or a per-sink service account). You must grant that identity write permission on the destination (e.g. roles/bigquery.dataEditor, roles/storage.objectCreator, roles/pubsub.publisher). The gcloud command prints the writer identity on creation — copy it and grant the role, or the sink silently drops everything.

# Route all GCE audit + serious app logs to a BigQuery dataset
gcloud logging sinks create app-to-bq \
  bigquery.googleapis.com/projects/PROJECT_ID/datasets/logs_dataset \
  --log-filter='resource.type="gce_instance" AND severity>=WARNING'

# gcloud prints the writerIdentity; grant it BigQuery write:
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member='serviceAccount:<writerIdentity-from-output>' \
  --role='roles/bigquery.dataEditor'

Aggregated sinks (org- and folder-level centralisation)

The single most important production pattern is the aggregated sink at the organisation or folder level. With --include-children, one sink at the org node captures logs from every project under it and routes them to a central destination — a security/logging project’s BigQuery dataset or a dedicated log bucket. This is the foundation of a governed logging lake: every project’s logs land in one tamper-resistant place the SOC can query, regardless of who created the project. This pattern is the entire subject of Centralized Logging Lake on GCP for Security and Compliance — read it next if you are responsible for an estate rather than a single project.

gcloud logging sinks create org-audit-lake \
  bigquery.googleapis.com/projects/SECURITY_PROJECT/datasets/org_logs \
  --organization=ORG_ID --include-children \
  --log-filter='logName:"cloudaudit.googleapis.com"'

Cloud Logging: querying, exclusions and retention

Logs Explorer and the Logging query language

The Logs Explorer is where you read logs interactively. It uses the Logging query language (LQL) — a filter syntax over the structured fields of a LogEntry:

resource.type="cloud_run_revision"
severity>=ERROR
jsonPayload.message=~"timeout|deadline"     -- regex match
timestamp>="2026-06-14T00:00:00Z"
labels."run.googleapis.com/execution_name"="job-x-abc"

Key fields you filter on constantly: resource.type, resource.labels.*, severity (DEFAULT < DEBUG < INFO < NOTICE < WARNING < ERROR < CRITICAL < ALERT < EMERGENCY), logName, textPayload, jsonPayload.*, protoPayload.* (audit logs live here), labels.*, httpRequest.*, and trace. The Explorer gives you a histogram of volume over time, a field-summary side panel for fast facet filtering, and saved/recent queries.

Exclusions (cost control)

Not every log is worth storing. An exclusion is a rule on the _Default sink (or any sink) that drops matching entries before ingestion, so you are not billed for them. Classic targets: health-check requests to a load balancer, verbose INFO/DEBUG from a chatty service, Dataflow shuffler logs. You can exclude all or a percentage (sample, e.g. keep 10%). Exclusions are the first lever to pull when your Logging bill surprises you.

gcloud logging sinks update _Default \
  --add-exclusion=name=drop-lb-healthchecks,filter='httpRequest.requestUrl="/healthz"'

Retention

Retention is set per bucket (1–3650 days). _Required is fixed at 400 days and free. The _Default bucket defaults to 30 days. Beyond your chosen retention you keep logs cheaply by routing a copy to Cloud Storage (with object lifecycle and Bucket Lock for WORM) or BigQuery for queryable history. Ingestion is billed once at intake; storing beyond the first 30 days incurs a per-GiB retention charge, which is why archiving to GCS for multi-year compliance is usually cheaper than long bucket retention.

Cloud Logging: log-based metrics

A log-based metric turns log volume or extracted values into a Cloud Monitoring metric so you can chart and alert on it. Two kinds:

Type What it produces Example
Counter A count of log entries matching a filter Number of severity=ERROR lines per minute → alert when it spikes
Distribution A histogram of a numeric value extracted from logs Latency parsed out of a log field → percentile charts

You can attach labels to a log-based metric (extracted from log fields with a regex or field path) so the resulting metric is dimensioned — e.g. count of 5xx broken down by status_code. There are system-defined log-based metrics (e.g. logging.googleapis.com/byte_count) and user-defined ones you create. This is the bridge from the logging world into the alerting world: when the only signal you have is a log line, a counter log-based metric + a threshold alert is how you page on it.

gcloud logging metrics create error_count \
  --description="App error log lines" \
  --log-filter='severity>=ERROR AND resource.type="cloud_run_revision"'

Cloud Logging: Log Analytics

Log Analytics lets you run standard SQL (BigQuery’s engine) directly over a log bucket, instead of LQL filters. Enable it per-bucket; you then query the logs as a table — JOIN, GROUP BY, window functions, the lot — which is far more powerful than the Explorer for aggregations like “top 20 user agents hitting 404 this week” or “p99 latency per endpoint per hour”. You can optionally create a linked BigQuery dataset so the same data is reachable from BigQuery and joinable with your business tables, at no additional storage cost (the data still lives in the log bucket). Log Analytics is the modern replacement for “route everything to BigQuery just so I can SQL it” — for analysis you often no longer need the BigQuery export at all.

-- In Log Analytics: 5xx count per Cloud Run service, last 24h
SELECT resource.labels.service_name AS svc,
       COUNT(*) AS errors
FROM   `PROJECT_ID.global._Default._AllLogs`
WHERE  http_request.status >= 500
  AND  timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
GROUP BY svc
ORDER BY errors DESC;

Cloud Trace, Cloud Profiler and Error Reporting

The application-performance trio answers “why is this request slow / crashing / hot?”.

Cloud Trace

Cloud Trace is distributed tracing: it records the spans of a request as it fans out across services and shows you a waterfall of where the latency went (which downstream call, which database query). It is fed by OpenTelemetry (the strategic, vendor-neutral standard) or the older Cloud Trace client libraries, and it is automatic for App Engine, Cloud Run, Cloud Functions and load balancers, which propagate the traceparent/X-Cloud-Trace-Context header for you. Trace gives you per-endpoint latency distributions, the ability to drill from a slow latency bucket into an exemplar trace, and trace ↔ log correlation (a log entry carrying a trace field links straight to its trace). It samples (you tune the rate) to keep cost and overhead low.

Cloud Profiler

Cloud Profiler is continuous, low-overhead production profiling. A small library in your app (Go, Java, Python, Node.js) periodically samples CPU time, heap allocation, contention and wall-clock, and renders flame graphs in the console so you can see which functions burn CPU or allocate memory in production, over time, comparable across versions. Overhead is a few percent, it is free, and it is the tool that finds the hot function you would never reproduce in a load test. There is nothing to query — you read flame graphs and filter by version/zone/profile-type.

Error Reporting

Error Reporting automatically aggregates and deduplicates exceptions and stack traces from your logs into error groups — so a million instances of the same NullPointerException become one entry with a count, a first/last-seen timestamp, a sparkline, and a sample stack trace. It works out of the box for crashes logged in the right structured format from App Engine, Cloud Run, Cloud Functions, GKE and Compute Engine, and lets you set notifications on new error types, mark issues resolved/muted, and link to the offending source. It is the fastest way to answer “what’s broken right now and is it new?”.

The Ops Agent

For Compute Engine VMs (and on-prem/other-cloud hosts via BindPlane), the Ops Agent is the single agent that collects both guest metrics and logs and ships them to Cloud Monitoring and Cloud Logging. It replaces the two legacy agents (the old Monitoring Agent based on collectd and the Logging Agent based on Fluentd), which are deprecated — on any new VM, install the Ops Agent, not the legacy pair. The Ops Agent is built on OpenTelemetry Collector (metrics) and Fluent Bit (logs).

What it gives you that the platform cannot see from outside the VM:

Configuration lives in /etc/google-cloud-ops-agent/config.yaml with two stanzas: logging (receivers → processors → service pipelines) and metrics (receivers → processors → service pipelines) — the OpenTelemetry shape. You can install it interactively, or at scale via the Ops Agent policy (a VM Manager / OS Config feature that auto-installs and keeps it updated across a fleet by label).

# Install the Ops Agent on a Linux VM
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent.sh
sudo bash add-google-cloud-ops-agent.sh --also-install
sudo systemctl status google-cloud-ops-agent"*"

The VM’s service account needs roles/monitoring.metricWriter and roles/logging.logWriter; the default Compute service account has these, but locked-down VMs often do not — a top cause of “agent installed but no data”.

Cloud Audit Logs

Audit logs are a distinct, security-critical class of logs that record who did what, where and when in your Google Cloud. There are four types and you must know which are on by default and which cost money:

Audit log type Records Default Configurable? Cost
Admin Activity API calls that modify config/metadata (create VM, change IAM) Always on, cannot be disabled No Free
System Event Google-system actions (live migration, automatic actions) Always on No Free
Data Access API calls that read config or read/write data (read a GCS object, query BigQuery) Off by default (except BigQuery) Yes — enable per service, per type (ADMIN_READ / DATA_READ / DATA_WRITE), with exemptions Chargeable (high volume)
Policy Denied When a request is denied by a security policy (VPC Service Controls, org policy) On when applicable Exemptable Chargeable

Admin Activity and System Event logs route to the immutable _Required bucket (400 days, free) — you cannot turn them off, which is exactly what you want for forensics. Data Access logging is where the judgement call lives: it is invaluable for compliance and breach investigation but can be enormous and expensive, so enable it selectively (the services that touch sensitive data) via the project/org IAM audit config, and consider routing it straight to cheap Cloud Storage. Access to audit logs is gated by separate roles — roles/logging.privateLogViewer is required to read Data Access logs, separating “can see app logs” from “can see who-touched-what”.

# Read recent Admin Activity audit logs
gcloud logging read \
  'logName:"cloudaudit.googleapis.com%2Factivity"' \
  --limit=10 --format='table(timestamp, protoPayload.methodName, protoPayload.authenticationInfo.principalEmail)'

Google Cloud Operations suite

The diagram above ties the suite together: workloads emit metrics, logs and traces; the Ops Agent adds guest signals from VMs; logs flow through the Log Router to buckets, BigQuery, Cloud Storage or Pub/Sub while feeding log-based metrics; and Monitoring layers dashboards, alerting policies, uptime checks and SLOs on top — all viewable across projects through a metrics scope.

Hands-on lab

We will install the Ops Agent on a tiny VM, generate a log, build a log-based metric, create an alerting policy with an email channel, add an uptime check, and clean everything up. Costs are pennies and fit inside the free trial; an e2-micro in an eligible region is in the always-free tier.

0. Set your project and enable the APIs.

export PROJECT_ID=$(gcloud config get-value project)
gcloud services enable \
  compute.googleapis.com monitoring.googleapis.com logging.googleapis.com

1. Create a small VM (the Ops Agent target).

gcloud compute instances create ops-lab \
  --zone=us-central1-a --machine-type=e2-micro \
  --image-family=debian-12 --image-project=debian-cloud

2. Install the Ops Agent.

gcloud compute ssh ops-lab --zone=us-central1-a --command='
  curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent.sh &&
  sudo bash add-google-cloud-ops-agent.sh --also-install &&
  systemctl is-active google-cloud-ops-agent'

Expected: active. Within a couple of minutes, agent metrics like agent.googleapis.com/memory/percent_used appear for this instance in Metrics Explorer.

3. Generate a log line we can meter on.

gcloud compute ssh ops-lab --zone=us-central1-a --command='
  logger -p user.err "KLOUDVIN_LAB synthetic error event"'

4. Create a counter log-based metric for that event.

gcloud logging metrics create kloudvin_lab_errors \
  --description="Lab synthetic errors" \
  --log-filter='resource.type="gce_instance" AND textPayload:"KLOUDVIN_LAB"'

5. Create an email notification channel and capture its id.

CH=$(gcloud beta monitoring channels create \
  --display-name="Lab email" --type=email \
  --channel-labels=email_address="$(gcloud config get-value account)" \
  --format='value(name)')
echo "$CH"

6. Create an alerting policy on the log-based metric. Write the policy file, then create it:

cat > /tmp/lab-policy.json <<EOF
{
  "displayName": "Lab error spike",
  "combiner": "OR",
  "conditions": [{
    "displayName": "errors > 0",
    "conditionThreshold": {
      "filter": "metric.type=\"logging.googleapis.com/user/kloudvin_lab_errors\" AND resource.type=\"gce_instance\"",
      "comparison": "COMPARISON_GT",
      "thresholdValue": 0,
      "duration": "0s",
      "aggregations": [{"alignmentPeriod": "60s", "perSeriesAligner": "ALIGN_SUM"}]
    }
  }],
  "notificationChannels": ["$CH"],
  "alertStrategy": {"autoClose": "1800s"}
}
EOF
gcloud alpha monitoring policies create --policy-from-file=/tmp/lab-policy.json

Generate the log line again (step 3) and within a few minutes you should receive an email and see an open incident in Monitoring → Alerting.

7. Add an uptime check (optional, public endpoint).

gcloud monitoring uptime create "lab-uptime" \
  --resource-type=uptime-url \
  --resource-labels=host=cloud.google.com \
  --protocol=https --path=/ --port=443 --period=5

Validation. In the console: Monitoring → Metrics Explorer shows agent.googleapis.com/* for ops-lab; Logging → Logs Explorer filtered to textPayload:"KLOUDVIN_LAB" shows your line; Monitoring → Alerting lists the policy and (after re-triggering) an incident; Monitoring → Uptime shows the check passing.

Cleanup. Order matters only in that the VM is the costly bit.

gcloud compute instances delete ops-lab --zone=us-central1-a --quiet
gcloud logging metrics delete kloudvin_lab_errors --quiet
# delete the policy and uptime check by id from the list commands:
gcloud alpha monitoring policies list --format='value(name,displayName)'
gcloud alpha monitoring policies delete POLICY_ID --quiet
gcloud monitoring uptime list-configs --format='value(name,displayName)'
gcloud monitoring uptime delete UPTIME_ID
gcloud beta monitoring channels delete "$CH" --quiet

Cost note. The e2-micro is free-tier-eligible in eligible US regions; otherwise it is roughly ₹500–₹700/month if left running, so delete it. Agent metrics and the few log lines are well within the 150 MiB metrics and 50 GiB logs free allotments. Uptime checks at this volume are free. Total lab spend if you clean up the same day: effectively zero.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Sink “created” but destination is empty The sink’s writer identity lacks write permission on the destination Grant the printed writer identity roles/bigquery.dataEditor / storage.objectCreator / pubsub.publisher
Ops Agent installed, no memory/disk metrics VM service account missing monitoring.metricWriter / logging.logWriter Add the roles, or use a SA that has them; restart the agent
Alerting policy never fires duration/retest window too long, wrong aligner, or grouping hides the breach Lower the duration, verify the metric in Metrics Explorer first, check group-by
Alert storm / flapping No retest window and no rate-limit; alerting on raw noisy points Add a “for N minutes” duration, a notification rate limit, sensible auto-close
Logging bill spikes Verbose/health-check logs ingested; Data Access logs enabled broadly Add exclusions on _Default; scope Data Access logging to sensitive services only
“No data” on a custom/Prometheus metric chart Metric not yet written, wrong metric kind in query (counter not wrapped in rate()) Confirm ingestion; use rate()/ALIGN_RATE on cumulative metrics
Can’t see another project’s metrics Project not in this project’s metrics scope Add it to the scope from the scoping/observability project
Audit log “missing” for a data read Data Access logging is off by default Enable DATA_READ for that service in the IAM audit config
Uptime check flaps as “down” Failing from a single region / strict content match Require failure from multiple locations; relax/verify the content matcher

Best practices

Security notes

Interview & exam questions

  1. What is a metrics scope and why would you create a separate monitoring project? A metrics scope is a list of projects whose metrics a single scoping project can view, chart and alert on together. A dedicated monitoring project as the scoping project gives you one pane of glass over many workload projects; alerting policies and dashboards in the scoping project can reference any project in its scope (up to ~375).

  2. Explain the Cloud Logging routing model end to end. Every ingested entry hits the Log Router and is matched against each sink’s inclusion/exclusion filter. Matching entries go to the sink’s destination — a log bucket, BigQuery, Cloud Storage, Pub/Sub, or another project. Two sinks exist automatically: _Required (mandated audit logs → immutable _Required bucket) and _Default (everything else → _Default bucket, editable). Each sink runs as a writer identity that needs IAM on the destination.

  3. _Required vs _Default bucket? _Required stores Admin Activity, System Event and Access Transparency logs for a fixed 400 days, is free, and cannot be modified or deleted. _Default stores everything else at 30 days by default (configurable 1–3650), and can be edited or disabled.

  4. What’s an aggregated sink and when do you use it? A sink at the organisation or folder level with --include-children that captures logs from all descendant projects into one central destination — the basis of a centralised logging lake for security and compliance.

  5. MQL vs PromQL vs the visual builder — when each? Visual builder for quick ad-hoc charts; MQL for ratios, cross-metric joins and time-shifts on Google Cloud metrics; PromQL (via Managed Service for Prometheus) to reuse Prometheus/Grafana assets and for Kubernetes-native teams.

  6. Define SLI, SLO, error budget and burn rate. SLI = measured good/valid ratio; SLO = target for that SLI over a window; error budget = 1 − SLO (allowed unreliability); burn rate = how fast you’re spending the budget relative to steady (1 = exactly on track to exhaust it by window end).

  7. How do you alert on SLOs without flapping? Multi-window, multi-burn-rate alerts: a fast-burn condition (high burn rate over a short window) pages immediately; a slow-burn condition (low burn rate over a long window) opens a ticket. This balances urgency against noise.

  8. Difference between an SLA and an SLO? An SLO is your internal reliability target; an SLA is the external, contractual promise (with penalties). Your SLO should be stricter than your SLA so you get warning before breaching the contract.

  9. What are the four audit-log types and which cost money? Admin Activity (writes, always-on, free), System Event (Google actions, always-on, free), Data Access (reads/data, off by default except BigQuery, chargeable, configurable), Policy Denied (denied requests, chargeable). Admin Activity and System Event are immutable in _Required.

  10. Ops Agent vs the legacy agents, and what does it add? The Ops Agent is the single modern agent (OpenTelemetry + Fluent Bit) replacing the deprecated collectd Monitoring Agent and Fluentd Logging Agent. It adds guest-OS signals the platform can’t see — memory, disk, process metrics and system/app logs — plus built-in integrations for NGINX, MySQL, etc.

  11. A sink shows no data in BigQuery — what’s the first thing you check? The sink’s writer identity’s IAM on the dataset. gcloud prints the writer identity at creation; it needs roles/bigquery.dataEditor on the destination or all entries are silently dropped.

  12. What is a log-based metric and why use one? It converts log matches into a Cloud Monitoring metric — a counter (count of matching entries) or a distribution (histogram of an extracted value). It is how you alert/chart when your only signal is a log line, bridging Logging into Monitoring’s alerting.

  13. Trace vs Profiler vs Error Reporting? Trace shows per-request latency waterfalls across services (distributed tracing). Profiler shows continuous production flame graphs of CPU/heap/contention to find hot code. Error Reporting aggregates and deduplicates exceptions into counted error groups with sample stack traces.

  14. How would you cut a surprise Cloud Logging bill? Add exclusion filters on _Default for health checks and chatty INFO/DEBUG (optionally sampling a percentage), scope Data Access logging to sensitive services only, shorten non-compliance retention, and archive long-term data to Cloud Storage rather than paying for long bucket retention.

Quick check

  1. Which bucket is immutable, free, and retains Admin Activity logs for 400 days?
  2. You need to view CPU metrics from 30 projects on one dashboard. What configuration makes that possible?
  3. Which query language would you use to reuse existing Prometheus alerting rules against Cloud Monitoring?
  4. What permission must a log sink’s writer identity hold to deliver to a Pub/Sub topic?
  5. In SLO terms, what does an error budget of 0.1% correspond to as an SLO target?

Answers

  1. The _Required log bucket.
  2. Add all 30 projects to the metrics scope of a single scoping (observability) project, then build the dashboard there.
  3. PromQL, via Google Cloud Managed Service for Prometheus.
  4. roles/pubsub.publisher on the destination topic.
  5. A 99.9% SLO (error budget = 1 − SLO).

Exercise

In a sandbox project, build a miniature SRE setup for a Cloud Run service (deploy gcr.io/cloudrun/hello if you don’t have one):

  1. Create a Service in Monitoring for the Cloud Run service and attach a latency SLO (e.g. 99% of requests under 500 ms over a rolling 7-day window).
  2. Add a fast-burn and a slow-burn burn-rate alerting policy, each wired to an email notification channel, with a runbook link in the documentation field.
  3. Create an uptime check against the service’s URL with content matching, and an uptime-failure alert.
  4. Create a counter log-based metric on severity>=ERROR for the service and a threshold alert on it.
  5. Route the service’s logs via a sink into a new Log Analytics-enabled bucket and write one SQL query computing p95 latency per hour.
  6. Export your dashboard and one alerting policy to JSON and re-create them with gcloud — proving you can manage them as code. Then delete everything.

Write down: which alert fired first when you sent error traffic, and whether the SLO or the log-based metric gave earlier warning.

Certification mapping

Glossary

Next steps

gcpcloud-monitoringcloud-loggingobservabilitysrestackdriver
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments