GCP Lesson 57 of 98

Google BigQuery, In Depth: Datasets, Tables, Partitioning, Slots & Pricing

BigQuery is Google Cloud’s serverless, fully managed data warehouse — a system you can load with a few rows or a few petabytes and query with standard SQL, without ever provisioning, patching, or sizing a server. There is no cluster to spin up, no storage volume to attach, no vacuum job to schedule. You create a dataset, load a table, and run SELECT; Google supplies the storage, the compute, and the elasticity behind a single endpoint. That serverless promise is exactly what makes BigQuery feel magical the first week and expensive the first month: because the engine will happily throw thousands of CPUs at a query that scans a terabyte, a learner who does not understand how BigQuery stores data and how BigQuery charges for compute can run up a real bill with a single careless SELECT *.

This lesson is deliberately exhaustive. We start from the object model — projects → datasets → tables and views — and the columnar storage that underpins everything. We cover every table type (native, external, and BigLake), the two physical-layout levers that decide your bill and your latency (partitioning and clustering), and then the part interviewers and the certification exams probe hardest: the two pricing models — on-demand bytes-scanned versus capacity (slots) with editions and reservations — and the full set of cost controls that stop a single query from paging you about the invoice. From there we go through loading and streaming (including the Storage Write API), BigQuery ML, federation and external queries, the three layers of access control (dataset, column, and row) with CMEK, materialised views, and INFORMATION_SCHEMA. Every core operation comes with real bq CLI and SQL so you can do this by hand or wire it into code. By the end you should be able to whiteboard BigQuery from memory and answer the follow-ups.

Learning objectives

By the end of this lesson you will be able to:

Prerequisites & where this fits

You need a Google Cloud project with billing enabled, basic SQL, and the gcloud/bq CLI from the earlier tooling lesson (the bq tool ships with the Cloud SDK). This is the Data deep-dive of the GCP Zero-to-Hero course and the analytics counterpart to the storage and database deep-dives: where Cloud SQL and Spanner serve operational (OLTP) traffic, BigQuery is the analytical (OLAP) warehouse that the rest of your data ecosystem — Pub/Sub, Dataflow, Looker, Vertex AI — reads from and writes to. If your bill or query latency is already a problem, treat the companion optimisation playbook Taming BigQuery Cost and Performance as the applied sequel to this lesson, and BigQuery Fine-Grained Security as the deep dive on column- and row-level access.

Core concepts

Serverless, with storage and compute fully decoupled. This is the single most important mental model in BigQuery. Your data sits in BigQuery’s managed columnar storage (replicated, encrypted, durable) and is billed by the byte-month, independently of any compute. When you run a query, BigQuery’s distributed execution engine — internally Dremel, running on a shared substrate called Borg, shuffling through an in-memory service and reading over the Jupiter petabit network — allocates compute on demand, runs your query, and releases the compute. You are billed for that compute separately from storage, on whichever pricing model you have chosen. Decoupling is why you can have a 5 PB table costing storage but $0 compute until someone queries it, and why ten analysts can hammer the same table without “sizing a cluster” for the busiest one.

The unit of compute is the slot. A slot is a virtual CPU-plus-memory unit of BigQuery processing power. Every query is broken into a tree of execution stages, and stages are run by slots in parallel. On on-demand pricing you do not see slots directly — you are billed by bytes scanned and BigQuery gives you a large shared pool (historically up to ~2,000 slots per project, best-effort). On capacity pricing you buy slots (in increments of 100) and your queries run within that allocation. Understanding slots is the key to understanding performance: a query that is “slow” is usually either scanning too many bytes (a storage-layout problem) or starved of slots (a capacity problem).

The object hierarchy. Resources nest as organisation → folder → project → dataset → table/view/routine/model. The project is the billing and quota boundary and the top-level container for BigQuery. A dataset is a logical container of tables and views that lives in one location (a region or multi-region) and is the unit at which you set a default location, default table expiration, default CMEK, and the primary IAM access grants. A table holds your data; a view is a saved query; a routine is a stored procedure, UDF, or table function; a model is a BigQuery ML model. A subtle but vital rule: a single query cannot join tables across two different locations — datasets in EU and us-central1 are physically separate, so plan locations up front.

Columnar storage (Capacitor) and why SELECT * is the cardinal sin. BigQuery stores each column of a table separately, in a compressed format called Capacitor. When you query, BigQuery reads only the columns your query references — so SELECT user_id FROM events on a 200-column table reads roughly 1/200th of the data of SELECT *. On the on-demand model you are billed for the bytes in the columns you read, so naming columns explicitly (and partition/cluster pruning, below) is not a style preference — it is the bill. Columnar storage is also why BigQuery is fantastic at aggregations over a few columns of enormous tables and comparatively poor at single-row point lookups (use Bigtable or Cloud SQL for those).

Editions. Capacity pricing is packaged as editionsStandard, Enterprise, and Enterprise Plus — each unlocking progressively more features (e.g., column-level security, CMEK, multi-statement transactions, and longer time-travel windows appear at higher tiers) at a higher per-slot-hour price. You attach an edition to a reservation of slots. We cover the matrix in the pricing section.

Projects, datasets, and tables: the object model

A dataset is created inside a project and pinned to a location that you cannot change afterwards (you can only copy/recreate elsewhere). Get the location right the first time. Datasets carry defaults that every table inherits unless overridden.

Dataset setting What it is Choices / default When to set it Gotcha
Location The region or multi-region where data physically lives A region (us-central1, europe-west2, asia-south1…) or multi-region (US, EU); no default — you must choose Always, at creation; align with data-residency and with the apps that read it Immutable; cross-location joins are impossible — you must copy data first
Default table expiration TTL applied to new tables in the dataset Off by default; e.g. 3600s, 30d Scratch/staging datasets so tables self-clean Applies only to new tables created after you set it; existing tables keep their own expiry
Default partition expiration TTL applied to partitions of new partitioned tables Off by default Rolling-window fact tables (keep 90 days, drop older partitions) Per-partition, not per-table — a cheap way to bound storage
Default rounding mode How NUMERIC/BIGNUMERIC values round ROUND_HALF_AWAY_FROM_ZERO (default) or ROUND_HALF_EVEN Financial data needing banker’s rounding Set at dataset level so all tables are consistent
Default CMEK Customer-managed encryption key for new tables Google-managed by default; or a Cloud KMS key Compliance requiring you to hold the key The key must be in the same region as the dataset; grant the BigQuery service account cryptoKeyEncrypterDecrypter
Description / labels Free-text + key:value metadata Empty by default Cost attribution and discovery Labels propagate to billing export — use them for FinOps
Time travel window How far back you can query historical data (FOR SYSTEM_TIME AS OF) 7 days default; configurable 2–7 days Shorten to cut storage cost; lengthen for safety Shorter window slightly reduces storage charges

Create a dataset and a table with bq:

# Create a dataset in the EU multi-region with a 7-day default table expiry
bq --location=EU mk \
  --dataset \
  --default_table_expiration 604800 \
  --description "Analytics staging" \
  my_project:analytics

# Create an empty native table with an explicit schema
bq mk --table \
  my_project:analytics.events \
  event_id:STRING,user_id:STRING,event_ts:TIMESTAMP,country:STRING,revenue:NUMERIC

Or with SQL (DDL), which is the modern, more expressive path:

CREATE SCHEMA IF NOT EXISTS `my_project.analytics`
  OPTIONS (location = 'EU', default_table_expiration_days = 7);

CREATE TABLE `my_project.analytics.events` (
  event_id  STRING,
  user_id   STRING,
  event_ts  TIMESTAMP,
  country    STRING,
  revenue   NUMERIC
);

Schema, modes, and nested data. Each column has a type (STRING, INT64, FLOAT64, NUMERIC/BIGNUMERIC, BOOL, BYTES, DATE, DATETIME, TIME, TIMESTAMP, GEOGRAPHY, JSON, INTERVAL, RANGE) and a mode: NULLABLE (default), REQUIRED, or REPEATED. A REPEATED column is an array; a RECORD/STRUCT column is a nested object. The combination — REPEATED RECORD — lets BigQuery model a one-to-many relationship inside a single row (e.g. an order with line items), which you query with UNNEST. Denormalising with nested/repeated fields is idiomatic BigQuery and usually beats classic star-schema joins at warehouse scale. Schema evolution is limited: you can add columns and relax a REQUIRED column to NULLABLE; you generally cannot rename, drop (except via DDL DROP COLUMN, which is supported), reorder, or change a column’s type in place — for the latter you recreate or use a CREATE TABLE AS SELECT with a cast.

Table types: native, external, and BigLake

Not every table physically lives in BigQuery storage. There are three families, and the distinction matters for performance, cost, and governance.

Table type Where data lives Format Performance Cost model Governance Use when
Native (managed) BigQuery’s own columnar storage (Capacitor) Internal Fastest; full feature set (partitioning, clustering, DML, time travel) Storage billed by BigQuery; query billed by bytes/slots Full (column/row security, CMEK) The default for anything you query often
External (federated) Outside BigQuery — Cloud Storage, Bigtable, Google Sheets, Cloud SQL (via connection) Parquet, ORC, Avro, CSV, JSON, etc. Slower (reads source each time; no caching of layout) No BigQuery storage charge; query billed by bytes read Limited — no fine-grained security on most Ad-hoc query of data you don’t want to load; data already in a lake
BigLake Cloud Storage (or other clouds via BigQuery Omni) Open formats (Parquet, ORC, Avro) + Apache Iceberg Near-native via metadata caching and manifest support No BigQuery storage charge; query billed by bytes; metadata cache cost Full fine-grained security (row/column) on lake data A governed lakehouse — open files with BigQuery-grade access control

Native tables are the workhorse: load once, then partitioning, clustering, DML (INSERT/UPDATE/MERGE/DELETE), time travel, and all security features apply.

External tables point BigQuery at data it does not own. You define a schema (or let BigQuery auto-detect) and BigQuery reads the source at query time. They are perfect for “query this CSV in a bucket once” or for joining a small reference Google Sheet into a report, but they pay for it: every query re-reads and re-parses the source, you cannot partition/cluster the underlying files the way you can a native table (though you can use Hive-style partitioning layouts in the path), and most fine-grained security does not apply. A useful middle ground is a partitioned external table over a Hive-partitioned bucket layout (/dt=2026-06-15/...), which lets BigQuery prune by the partition column without loading the data.

BigLake tables are the strategic answer to “I have a data lake of open-format files but I want BigQuery’s governance.” A BigLake table is an external table accessed through a connection (a service-account identity managed by BigQuery), which means BigQuery — not the caller — reads the files. That indirection is what unlocks row-level and column-level security on data that physically sits as Parquet in a bucket, plus metadata caching for near-native performance and support for Apache Iceberg (including, with BigLake managed tables, DML and streaming into Iceberg). With BigQuery Omni, BigLake can even query data sitting in Amazon S3 or Azure Blob Storage in place. If you are building a lakehouse on GCP, BigLake is the default; plain external tables are for the quick, ungoverned case.

Partitioning: time, ingestion-time, and integer-range

Partitioning physically divides a table into segments so that a query with a filter on the partition column reads only the relevant segments — partition pruning. This is the single biggest lever on on-demand cost and a major lever on slot consumption. A table can have one partitioning scheme.

Partition type Partition column Granularity Pruned by Typical use
Time-unit column A DATE, TIMESTAMP, or DATETIME column you choose Hour, day (default), month, or year WHERE event_ts BETWEEN … Fact/event tables with a natural event time
Ingestion time The pseudo-column _PARTITIONTIME (and _PARTITIONDATE) Hour, day, month, year WHERE _PARTITIONTIME >= … Streaming/append loads where you partition by when data arrived
Integer range An INT64 column, with start/end/interval Buckets of the range WHERE customer_id BETWEEN … Sharding by a numeric key (customer_id, region_id)

Key facts an exam will test:

CREATE TABLE `my_project.analytics.events_part`
PARTITION BY DATE(event_ts)
OPTIONS (
  require_partition_filter = TRUE,
  partition_expiration_days = 90
) AS
SELECT * FROM `my_project.analytics.events`;

-- This query prunes to a single day's partition:
SELECT country, SUM(revenue)
FROM `my_project.analytics.events_part`
WHERE DATE(event_ts) = '2026-06-15'   -- partition filter (required)
GROUP BY country;

Use the Query validator (the green tick in the console, or --dry_run) to confirm pruning is working — it shows “This query will process N” before you run it. A dry run that still reports the full table size means your filter is not pruning (a common cause: wrapping the partition column in a non-trivial function so BigQuery cannot match it to partition boundaries).

Clustering: when, how, and how it differs from partitioning

Clustering sorts the data within each partition (or within the whole table if it is unpartitioned) by up to four columns, in priority order. When a query filters or aggregates on the leading clustered column(s), BigQuery reads only the relevant sorted blocks — block pruning — and can also speed up GROUP BY/ORDER BY on those columns.

The distinction the certification loves:

Rules of thumb:

CREATE TABLE `my_project.analytics.events_pc`
PARTITION BY DATE(event_ts)
CLUSTER BY country, user_id AS
SELECT * FROM `my_project.analytics.events`;

Pricing models: on-demand vs capacity (slots & editions)

This is the section that decides your bill, so treat it as load-bearing. BigQuery bills storage and compute separately, and compute has two models you choose between (you can even mix them across projects/reservations).

Storage pricing (same regardless of compute model):

Storage type What it is Roughly Notes
Active storage Data in tables/partitions modified in the last 90 days ~$0.02 / GB-month (region-dependent) The default
Long-term storage A table/partition not modified for 90 days ~50% off automatically (~$0.01 / GB-month) No action needed; same performance; the discount is per-partition
Physical (compressed) storage billing Bill on compressed bytes instead of logical (uncompressed) bytes Higher per-GB rate but on far fewer bytes Opt in per dataset; usually cheaper for well-compressing data — but it then bills time-travel + fail-safe storage too

Compute — Model 1: On-demand (bytes scanned). You pay per byte your queries read (~$6.25 per TiB scanned, region-dependent), with the first 1 TiB per month free. No commitment, no slot management; BigQuery gives you a large shared, best-effort slot pool. This is ideal for spiky, unpredictable, or low-volume usage. The risks: cost is proportional to bytes scanned (so layout discipline is everything), and you have no isolation — a noisy project shares the pool.

Compute — Model 2: Capacity (slots via editions + reservations). You buy slots and pay per slot-hour regardless of bytes scanned. You pick an edition (Standard/Enterprise/Enterprise Plus), create a reservation of a baseline number of slots, and optionally enable autoscaling up to a maximum. Slots can be purchased pay-as-you-go (per second, no commitment) or with 1-year or 3-year commitments for a discount. This model is ideal for steady, high-volume workloads, for cost predictability (you cap compute spend at the reservation ceiling), and for workload isolation (give ETL and BI their own reservations so neither starves the other).

The edition feature matrix (representative — confirm current details against pricing docs):

Capability Standard Enterprise Enterprise Plus
Pay-as-you-go slots Yes Yes Yes
1yr / 3yr commitments Yes Yes
Autoscaling reservations Yes Yes Yes
Max time-travel window 7 days 7 days longer
Column-level & row-level security Yes Yes
CMEK Yes Yes
Multi-statement transactions, BigQuery ML (advanced), VPC-SC integration limited Yes Yes (+ compliance: assured workloads)

How to choose — the decision table:

Situation Pick Why
New project, unpredictable/low volume On-demand No commitment; first 1 TiB/month free; pay only for what you scan
Steady multi-TB/day, want a predictable bill Capacity (commitment) Cap compute spend; discount on committed slots
Need workload isolation (ETL vs BI vs ad-hoc) Capacity with multiple reservations Each workload gets guaranteed slots
Need column/row security or CMEK Capacity, Enterprise+ Those features require an edition
Bursty batch jobs a few hours a day Capacity autoscaling, PAYG Baseline 0, scale up only when jobs run, pay per second

A common mature pattern is hybrid: keep ad-hoc analyst queries on on-demand (cheap when idle) while pinning the predictable ETL pipeline to a committed reservation.

Manage reservations with bq:

# Create an Enterprise edition reservation: 100 baseline slots, autoscale to 500
bq mk --reservation \
  --project_id=admin_project --location=US \
  --edition=ENTERPRISE \
  --slots=100 --autoscale_max_slots=500 \
  prod_reservation

# Assign a project's queries to that reservation
bq mk --reservation_assignment \
  --reservation_id=admin_project:US.prod_reservation \
  --job_type=QUERY \
  --assignee_id=my_data_project --assignee_type=PROJECT

Cost control: the guardrails that cap the bill

On the on-demand model, a single query can cost real money. BigQuery gives you layered guardrails — set them as defaults, not afterthoughts.

Control Scope What it does
Maximum bytes billed Per query / per job Hard ceiling — the query fails instead of running if it would scan more than the limit (--maximum_bytes_billed or OPTIONS)
require_partition_filter Per table No query may scan the table without a partition filter
Custom quotas Per project / per user / per day Cap total bytes a project or a user can scan in a day
Reservations Org-wide compute On capacity, your bill cannot exceed the reservation ceiling
Dry run / query validator Per query Shows bytes-to-be-scanned before you run; budget gate in CI
Materialised views & BI Engine Per query pattern Serve repeated queries from precomputed/in-memory results, scanning far less
Long-term storage & partition expiry Per table/partition Automatically shrink storage cost
Budgets & alerts Billing account Email/Pub/Sub alert at thresholds (does not stop spend, but warns)
# Fail any query that would scan more than 10 GB
bq query --use_legacy_sql=false \
  --maximum_bytes_billed=10000000000 \
  'SELECT country, SUM(revenue) FROM `my_project.analytics.events_part`
   WHERE DATE(event_ts) = "2026-06-15" GROUP BY country'
# Dry run: how many bytes will this scan? (no compute charged)
bq query --use_legacy_sql=false --dry_run \
  'SELECT * FROM `my_project.analytics.events_part`'

The query results cache is a free, automatic win: identical queries (same SQL, same underlying data, deterministic) are served from cache at no compute cost. Caching is disabled by non-deterministic functions (CURRENT_TIMESTAMP(), RAND()), by querying tables that have changed, and when you opt out. Writing cache-friendly SQL is a legitimate cost technique.

Loading and streaming data

There are many ways to get data into a native table; the right one depends on latency, volume, and cost.

Method Latency Cost Use when
Batch load job (from Cloud Storage or local file) Minutes Free (no compute charge for the load itself) Bulk import of files; the default for ETL
LOAD DATA / bq load Minutes Free Same as above, via SQL or CLI
INSERT / MERGE (DML) Seconds–minutes Billed as a query (bytes/slots) Transforming and upserting within BigQuery
Legacy streaming (tabledata.insertAll) Seconds Per-GB streaming charge Older real-time ingest (being superseded)
Storage Write API Seconds Cheaper per-GB than legacy streaming; some free allowance The modern real-time + high-throughput path
Query results to table (CREATE TABLE AS SELECT) Query time Billed as a query Materialising a transformation
BigQuery Data Transfer Service Scheduled Free transfer, query billed Recurring loads from SaaS/other clouds/GCS

Batch loads are free and the workhorse. Supported formats: Avro, Parquet, ORC (self-describing and fastest to load), plus CSV, JSON (newline-delimited), Datastore/Firestore exports. You control --source_format, --autodetect (schema inference), write disposition (WRITE_APPEND/WRITE_TRUNCATE/WRITE_EMPTY), --max_bad_records, and delimiters/skip-leading-rows for CSV.

# Batch load newline-delimited JSON from a bucket, append, auto-detect schema
bq load --source_format=NEWLINE_DELIMITED_JSON --autodetect \
  my_project:analytics.events \
  gs://my-bucket/events/2026-06-15/*.json

The Storage Write API is the modern, unified ingestion path and the one to learn. It is a high-throughput gRPC streaming API that supports exactly-once delivery (via stream offsets), at-least-once for maximum throughput, and buffered/pending streams for transactional commits. It is cheaper than the legacy insertAll streaming API, integrates natively with Dataflow, and is what you should reach for whenever you need rows queryable within seconds of arrival. The older tabledata.insertAll still works but new designs should default to the Storage Write API.

Time travel and the fail-safe window. BigQuery keeps changed/deleted data for the dataset’s time-travel window (2–7 days) so you can query or restore a table as it was (FOR SYSTEM_TIME AS OF), and beyond that a non-configurable 7-day fail-safe that Cloud Support can recover from. This is your safety net against a bad MERGE or an accidental DELETE.

BigQuery ML: models in SQL

BigQuery ML (BQML) lets you create and run machine-learning models using only SQL, inside the warehouse, with no data movement. You CREATE MODEL, BigQuery trains it (consuming slots/bytes like a query), and you score with ML.PREDICT. It democratises ML for analysts and is ideal when the data already lives in BigQuery.

Supported model families include linear & logistic regression, k-means clustering, matrix factorisation (recommenders), time-series forecasting (ARIMA_PLUS), boosted trees and random forests (via XGBoost), deep neural networks and wide-and-deep, AutoML Tables, PCA, and imported TensorFlow/ONNX models. Crucially, BQML also offers remote models over Vertex AI, including calls to generative/LLM endpoints (ML.GENERATE_TEXT, embeddings via ML.GENERATE_EMBEDDING) — so you can run generative AI over warehouse rows in SQL.

-- Train a logistic-regression churn model
CREATE OR REPLACE MODEL `my_project.analytics.churn_model`
OPTIONS (model_type = 'LOGISTIC_REG', input_label_cols = ['churned']) AS
SELECT tenure_months, monthly_spend, support_tickets, churned
FROM `my_project.analytics.customer_features`;

-- Evaluate, then predict
SELECT * FROM ML.EVALUATE(MODEL `my_project.analytics.churn_model`);

SELECT customer_id, predicted_churned, predicted_churned_probs
FROM ML.PREDICT(MODEL `my_project.analytics.churn_model`,
                (SELECT * FROM `my_project.analytics.scoring_set`));

Federation and external queries

Beyond external/BigLake tables, BigQuery can run federated queries against operational databases via the EXTERNAL_QUERY function and a BigQuery connection to Cloud SQL or AlloyDB (and Spanner). The query is pushed to the source database and results streamed back, letting you join live operational data with warehouse data without an ETL pipeline:

SELECT w.country, SUM(w.revenue) AS warehouse_rev, live.open_tickets
FROM `my_project.analytics.events_part` AS w
JOIN EXTERNAL_QUERY(
  'my_project.us.cloudsql_conn',
  'SELECT country, COUNT(*) AS open_tickets FROM tickets WHERE status = ''open'' GROUP BY country'
) AS live
USING (country)
WHERE DATE(w.event_ts) = '2026-06-15'
GROUP BY w.country, live.open_tickets;

Use federation for freshness and to avoid copying small reference data; for large analytical scans, load into native tables.

Security: dataset, column, and row access (with CMEK)

BigQuery secures data in layers; an exam expects you to know all three plus encryption.

IAM at project/dataset/table level (coarse-grained). Predefined roles map to intent: roles/bigquery.dataViewer (read data + metadata), dataEditor (read/write data), dataOwner (full control of a dataset), bigquery.jobUser (run jobs — needed in addition to data access to actually query), bigquery.user (run jobs + create datasets), and bigquery.admin. The classic gotcha: a user with dataViewer on a dataset but no jobUser in the project can see tables but cannot run a query — both are required. You can grant access at the dataset level (the primary boundary) and, increasingly, down to table/view level with IAM conditions.

Column-level security (CLS). Tag sensitive columns with policy tags from a Data Catalog taxonomy; only principals granted the Fine-Grained Reader role on a tag can read those columns. Layer dynamic data masking on top to return a masked value (hash, default, nullify, last-4) to lower-privileged readers instead of denying the column entirely.

Row-level security (RLS). Create row access policies so each principal sees only the rows they are entitled to — e.g. an EMEA analyst sees only region = 'EMEA':

CREATE ROW ACCESS POLICY emea_only
ON `my_project.analytics.events_part`
GRANT TO ('group:emea-analysts@example.com')
FILTER USING (country IN ('FR','DE','GB','ES','IT'));

Both CLS and RLS are Enterprise-edition features (or require the equivalent on on-demand). For the full treatment — taxonomies, masking routines, and authorised views — see BigQuery Fine-Grained Security.

Authorised views, datasets, and routines. An authorised view lets users query a curated view in one dataset without granting them access to the underlying tables in another — you author the view’s logic (selecting/aggregating only what they should see) and authorise the view (not the user) onto the source dataset. Authorised datasets and routines extend the same idea.

Encryption. Data is always encrypted at rest with Google-managed keys by default. For control, set a CMEK (Cloud KMS key) at the dataset (default) or table level; the key must be regionally co-located and the BigQuery service agent needs encrypt/decrypt rights. CMEK is an Enterprise-edition capability. Encryption in transit is automatic. For defence-in-depth around the perimeter, wrap BigQuery in VPC Service Controls to stop data exfiltration to projects outside your perimeter.

Materialised views and BI Engine

A materialised view (MV) is a precomputed, incrementally maintained result of a query over a base table. Unlike a logical view (which re-runs every time), an MV stores the aggregated result and BigQuery refreshes it automatically as the base table changes — and, critically, the optimiser will silently rewrite a query against the base table to read the MV when it can, so existing queries get faster without being changed (smart tuning). MVs are ideal for expensive, repeated aggregations (daily revenue by country). Constraints: they support a subset of SQL (aggregations, GROUP BY; limited joins), incur storage and refresh cost, and may serve slightly stale results between refreshes (you control max staleness). Compare with a plain scheduled query writing to a table (full control, manual freshness) and with BI Engine, an in-memory analysis layer you allocate (in GB) that caches hot data to serve dashboards (Looker Studio, Looker) at sub-second latency, scanning far fewer billed bytes.

CREATE MATERIALIZED VIEW `my_project.analytics.daily_rev`
PARTITION BY day AS
SELECT DATE(event_ts) AS day, country, SUM(revenue) AS revenue
FROM `my_project.analytics.events_part`
GROUP BY day, country;

INFORMATION_SCHEMA: observing the warehouse

BigQuery exposes its own metadata and telemetry through INFORMATION_SCHEMA views — your built-in observability and FinOps toolkit. Useful views:

-- Top 10 most expensive queries in the last 7 days
SELECT user_email,
       job_id,
       ROUND(total_bytes_billed / POW(1024,4), 3) AS tib_billed,
       query
FROM `region-EU`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY' AND state = 'DONE'
ORDER BY total_bytes_billed DESC
LIMIT 10;

Google BigQuery deep dive

The diagram traces a query from the project and dataset boundary, through a partitioned-and-clustered native table in columnar storage, into the slot-based execution engine, and shows where each pricing model and each access-control layer applies.

Hands-on lab

This lab uses the public dataset and your free tier (1 TiB scanned/month + 10 GB storage free) — total cost effectively $0 if you stay within the free allowance. You need a project with billing enabled and the Cloud SDK (bq).

1. Create a dataset (EU multi-region):

bq --location=EU mk --dataset --description "BigQuery lab" "$(gcloud config get-value project):bqlab"

2. Dry-run a query against a public dataset to see bytes-to-scan (no charge):

bq query --use_legacy_sql=false --dry_run \
  'SELECT name, SUM(number) AS births
   FROM `bigquery-public-data.usa_names.usa_1910_2013`
   WHERE state = "CA" GROUP BY name ORDER BY births DESC LIMIT 10'

Expected: a line like “Query successfully validated. … will process N bytes” — note N is small because only two columns are read (columnar storage in action).

3. Materialise a small, partitioned + clustered table into your dataset:

bq query --use_legacy_sql=false \
'CREATE TABLE `bqlab.names_by_state`
 PARTITION BY RANGE_BUCKET(year, GENERATE_ARRAY(1910, 2014, 10))
 CLUSTER BY state, name AS
 SELECT state, name, year, number
 FROM `bigquery-public-data.usa_names.usa_1910_2013`'

4. Confirm pruning works — run with and without a cluster-key filter and compare bytes billed via --dry_run:

bq query --use_legacy_sql=false --dry_run \
  'SELECT SUM(number) FROM `bqlab.names_by_state` WHERE state = "NY"'

5. Set a cost guardrail and prove it fails a runaway query:

bq query --use_legacy_sql=false --maximum_bytes_billed=1000000 \
  'SELECT * FROM `bigquery-public-data.usa_names.usa_1910_2013`'
# Expect: an error — "Query exceeded limit for bytes billed" — the guardrail worked.

6. Inspect your spend with INFORMATION_SCHEMA:

bq query --use_legacy_sql=false \
 'SELECT job_id, total_bytes_billed
  FROM `region-EU`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
  WHERE job_type="QUERY" ORDER BY creation_time DESC LIMIT 5'

Validation: the dry run in step 2 reports a small byte count; the table in step 3 is created; step 5 fails by design; step 6 lists your recent jobs with bytes billed.

Cleanup: delete the whole dataset (and its tables) to leave nothing behind.

bq rm -r -f -d "$(gcloud config get-value project):bqlab"

Cost note: storage of the small lab table is a few KB (free tier covers 10 GB), and every query above was either a free dry run or scanned a few MB against your 1 TiB free monthly allowance — so the lab is effectively free. The only way to incur cost here is to drop the --dry_run/guardrails and SELECT * the full public table repeatedly.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Query scans the whole table despite a date filter Partition column wrapped in a function, or no partitioning at all Filter directly on the partition column; verify with --dry_run; add partitioning
Surprise large bill on on-demand SELECT *, no partition/cluster pruning, no maximum_bytes_billed Name columns; partition + cluster; set max-bytes-billed and require_partition_filter
“User does not have permission to query” but they can see the table Has dataViewer but no jobUser/user in the project Grant roles/bigquery.jobUser (running a query needs both data and job permissions)
Cannot join two tables — “not found in location” Datasets in different locations (e.g. US vs EU) Copy one dataset to the other location; you cannot join across locations
Streaming rows not visible / can’t be updated immediately Rows in the streaming buffer (__UNPARTITIONED__) Wait for colocation; design for eventual visibility; prefer Storage Write API
Clustering “isn’t helping” (dry run shows full size) Cluster pruning is best-effort and invisible to dry run; or filtering a non-leading cluster column Check actual bytes billed in JOBS; filter the leading clustered column(s)
CREATE TABLE with CMEK fails KMS key in wrong region, or BigQuery service agent lacks key access Co-locate key with dataset; grant roles/cloudkms.cryptoKeyEncrypterDecrypter to the BQ service account
Materialised view not used by the optimiser Query shape outside MV’s supported rewrite, or MV stale beyond max staleness Align the query/MV; check refresh; consider a scheduled query instead

Best practices

Security notes

Cost & sizing

The levers that move the BigQuery bill, in order of impact:

  1. Bytes scanned (on-demand) — partition + cluster pruning, naming columns, and the query cache cut this directly.
  2. Slot-hours (capacity) — your reservation baseline + autoscale max set the ceiling; commitments discount it; isolate workloads to avoid over-provisioning one big pool.
  3. Storage — active vs long-term (automatic 50% off after 90 days), logical vs physical (compressed) billing, and partition expiry.
  4. Streaming — legacy insertAll is pricier per-GB than the Storage Write API; batch loads are free.
  5. Materialised views / BI Engine — spend a little on precompute/cache to save a lot on repeated scans.

Sizing capacity: start with a small baseline (e.g. 100 slots) plus autoscaling, watch INFORMATION_SCHEMA.JOBS for slot contention and queue time, and only then buy a commitment for the steady baseline you actually use. For the applied tuning workflow end to end, see Taming BigQuery Cost and Performance.

Interview & exam questions

  1. Why is SELECT * discouraged in BigQuery, beyond style? Because storage is columnar (Capacitor) and on-demand billing is by bytes read — * reads every column, multiplying the bytes scanned (and the cost) versus naming only the columns you need.
  2. Partitioning vs clustering — when each, and how do their cost estimates differ? Partition (one column, usually time) for coarse, exact pruning known before you run (dry run shows it); cluster (up to four high-cardinality columns) for fine-grained, best-effort pruning whose savings appear only in actual bytes billed. Use both together.
  3. On-demand vs capacity (slots/editions) — how do you choose? On-demand (bytes scanned, first 1 TiB/month free) for spiky/low/unpredictable usage; capacity (buy slots per edition, optionally committed) for steady high volume, predictable bills, and workload isolation. Hybrid is common.
  4. What is a slot? A unit of BigQuery compute (vCPU + memory) that executes a stage of a query in parallel; queries are billed by bytes (on-demand) or constrained by your purchased slots (capacity).
  5. Native vs external vs BigLake tables? Native = data in BigQuery storage, full features. External = data read in place from GCS/Sheets/etc. each query, limited governance. BigLake = external open-format files accessed via a connection, giving near-native performance and row/column security (the governed lakehouse).
  6. A user can see tables but every query fails with a permission error — why? They have data access (dataViewer) but not bigquery.jobUser (or user); running a query needs both data and job permissions.
  7. How do you stop one query from costing a fortune? maximum_bytes_billed (fails over the cap), require_partition_filter, custom per-user/project daily quotas, dry-run gating in CI, and on capacity, the reservation ceiling.
  8. What is the Storage Write API and why prefer it over insertAll? A high-throughput gRPC streaming ingest API with exactly-once (offsets), at-least-once, and pending/buffered streams; it is cheaper and the modern default for real-time ingest.
  9. How does BigQuery do machine learning without moving data? BigQuery ML — CREATE MODEL / ML.PREDICT in SQL — trains and scores in the warehouse, supports regression/clustering/forecasting/boosted-trees/DNN and remote Vertex AI / generative models.
  10. Explain the three layers of access control. Dataset/table IAM (coarse), column-level security via policy tags + masking, and row-level access policies (FILTER USING); authorised views expose curated data without base-table access.
  11. Why can’t you join a US dataset with an EU dataset? Datasets are physically pinned to a location that is immutable; a query runs in one location, so you must copy data to a common location first.
  12. What is a materialised view and how does it save money “for free”? A precomputed, incrementally refreshed query result; the optimiser transparently rewrites base-table queries to read the MV (smart tuning), so existing queries scan far fewer bytes without being changed.

Quick check

  1. On the on-demand model, what determines the cost of a query?
  2. How many columns can you cluster by, and what kind of column suits clustering?
  3. Which permission, in addition to data access, is required to run a query?
  4. Which table type gives row/column security over open-format files in a bucket?
  5. What does require_partition_filter = TRUE do?

Answers

  1. The number of bytes scanned/billed (driven by the columns referenced and partition/cluster pruning); first 1 TiB/month is free.
  2. Up to four columns, in priority order; high-cardinality columns you filter/aggregate by equality.
  3. roles/bigquery.jobUser (or bigquery.user) — you need both job and data permissions.
  4. BigLake tables (external files accessed via a connection, with fine-grained security).
  5. Forces every query against the table to include a filter on the partition column, preventing accidental full-table scans.

Exercise

Take a wide event table you control (or bigquery-public-data.google_analytics_sample.ga_sessions_*). (1) Create a native copy partitioned by date and clustered by two high-cardinality columns, with require_partition_filter on and a 90-day partition expiry. (2) Write a query that prunes to one day and uses the cluster keys; confirm bytes-to-scan with --dry_run and compare against the unpartitioned original. (3) Add --maximum_bytes_billed and deliberately trip it. (4) Build a materialised view of a daily aggregate and confirm via INFORMATION_SCHEMA.JOBS that querying the base table now bills fewer bytes. (5) Apply a row access policy restricting one group to a subset of rows and verify the filter applies. Write up which lever saved the most bytes and why.

Certification mapping

Glossary

Next steps

GCPBigQueryData EngineeringPartitioningSlotsACE
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments