Quick take — A reusable hashicorp/google Terraform module for BigQuery: provision a dataset with location, default expirations, CMEK and access grants, plus partitioned, clustered tables with JSON schemas and deletion protection from typed variables. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "google" {
project = "my-project"
region = "us-central1"
}
module "bigquery" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery?ref=v1.0.0"
project_id = "..." # GCP project ID that owns the dataset and tables.
dataset_id = "..." # Dataset ID; letters, numbers and underscores only.
location = "..." # Dataset location (`US`, `EU`, or a region). Immutable a…
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
BigQuery is Google Cloud’s serverless, columnar data warehouse: you create a dataset (a regional or multi-regional container that scopes location, IAM, default expirations and the CMEK key) and then tables inside it that hold the data. There is no cluster to size — you pay for bytes stored and for bytes scanned per query — which makes BigQuery cheap to start and very easy to turn into a surprise bill when an unpartitioned table gets SELECT *-ed over its entire history.
The two design decisions that control that cost live on the table: partitioning (usually by a DATE/TIMESTAMP column or by ingestion time, so a query with a date filter scans one day instead of three years) and clustering (sorting within a partition by up to four columns so equality and range filters prune blocks). Get those right, plus require_partition_filter to force callers to supply a partition predicate, and the same table can be an order of magnitude cheaper to query. Get them wrong — or forget them — and every query is a full-table scan.
A raw google_bigquery_dataset + google_bigquery_table pair also forces every team to re-decide the same governance details each time: the immutable location, default table/partition expirations, delete_contents_on_destroy, deletion_protection, the access list, and the encryption key. This module wraps google_bigquery_dataset and a for_each set of google_bigquery_table resources behind typed, validated variables so a consuming team passes intent — “a US dataset, CMEK-encrypted, two tables partitioned by day and clustered, protected” — and gets a correct, governed warehouse every time.
When to use it
- You are standing up per-domain or per-team datasets (
sales_curated,events_raw) and want consistent location, expiration, labels and IAM rather than hand-built one-offs. - You need tables that are partitioned and clustered by default so query cost stays bounded as data grows, and
require_partition_filterto stop accidental full scans. - You want lifecycle hygiene — default table/partition expirations on a raw/staging dataset so transient data ages out automatically instead of accumulating storage cost forever.
- You must satisfy encryption or residency requirements: a customer-managed (CMEK) key on the dataset and a pinned
locationthat auditors can verify in code. - You want deletion protection and a deliberate
delete_contents_on_destroychoice so a strayterraform destroycannot silently drop a populated dataset. - Reach for plain SQL
CREATE TABLEorbqinstead when you are doing throwaway exploration where none of this governance matters.
Module structure
terraform-module-gcp-bigquery/
├── versions.tf # provider + Terraform version pins
├── main.tf # dataset + access + CMEK + tables
├── variables.tf # typed, validated inputs
└── outputs.tf # dataset id/self_link + table ids/refs
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
main.tf
resource "google_bigquery_dataset" "this" {
project = var.project_id
dataset_id = var.dataset_id
location = var.location
friendly_name = coalesce(var.friendly_name, var.dataset_id)
description = var.description
# Defaults inherited by tables that do not set their own expiration.
# null = never expire. Values are in milliseconds.
default_table_expiration_ms = var.default_table_expiration_ms
default_partition_expiration_ms = var.default_partition_expiration_ms
# When true, `terraform destroy` can drop a dataset that still has tables.
# Keep false for any dataset that holds real data.
delete_contents_on_destroy = var.delete_contents_on_destroy
labels = var.labels
# Customer-managed encryption for data at rest (optional).
dynamic "default_encryption_configuration" {
for_each = var.kms_key_name == null ? [] : [var.kms_key_name]
content {
kms_key_name = default_encryption_configuration.value
}
}
# Access grants. Each entry sets exactly one principal type
# (user_by_email / group_by_email / special_group / iam_member)
# against a role. Setting `access` replaces BigQuery's default ACL,
# so include owners explicitly when you use it.
dynamic "access" {
for_each = var.access
content {
role = access.value.role
user_by_email = access.value.user_by_email
group_by_email = access.value.group_by_email
special_group = access.value.special_group
iam_member = access.value.iam_member
}
}
}
resource "google_bigquery_table" "this" {
for_each = var.tables
project = var.project_id
dataset_id = google_bigquery_dataset.this.dataset_id
table_id = each.key
description = each.value.description
schema = each.value.schema
labels = each.value.labels
# Guards the table against `terraform destroy` / replacement.
deletion_protection = each.value.deletion_protection
clustering = each.value.clustering
# Column-based or ingestion-time DAY/HOUR/MONTH/YEAR partitioning.
dynamic "time_partitioning" {
for_each = each.value.time_partitioning == null ? [] : [each.value.time_partitioning]
content {
type = time_partitioning.value.type
field = time_partitioning.value.field
expiration_ms = time_partitioning.value.expiration_ms
require_partition_filter = time_partitioning.value.require_partition_filter
}
}
# Per-table CMEK override; falls back to the dataset key when null.
dynamic "encryption_configuration" {
for_each = each.value.kms_key_name == null ? [] : [each.value.kms_key_name]
content {
kms_key_name = encryption_configuration.value
}
}
}
variables.tf
variable "project_id" {
type = string
description = "GCP project ID that owns the dataset and tables."
}
variable "dataset_id" {
type = string
description = "Dataset ID. Letters, numbers and underscores only; up to 1024 chars."
validation {
condition = can(regex("^[A-Za-z0-9_]{1,1024}$", var.dataset_id))
error_message = "dataset_id may contain only letters, numbers and underscores (no hyphens or dots)."
}
}
variable "location" {
type = string
description = "Dataset location (e.g. 'US', 'EU', or a region like 'asia-south1'). IMMUTABLE after create."
validation {
condition = length(var.location) > 0
error_message = "location must be set; it cannot be changed after the dataset is created."
}
}
variable "friendly_name" {
type = string
default = null
description = "Human-readable dataset name shown in the console. Defaults to dataset_id."
}
variable "description" {
type = string
default = null
description = "Free-text description of the dataset's purpose."
}
variable "default_table_expiration_ms" {
type = number
default = null
description = "Default lifetime for new tables, in ms (null = never). Min 3600000 (1h) when set."
validation {
condition = var.default_table_expiration_ms == null || var.default_table_expiration_ms >= 3600000
error_message = "default_table_expiration_ms must be null or at least 3600000 (1 hour)."
}
}
variable "default_partition_expiration_ms" {
type = number
default = null
description = "Default lifetime for partitions in new partitioned tables, in ms (null = never)."
validation {
condition = var.default_partition_expiration_ms == null || var.default_partition_expiration_ms >= 3600000
error_message = "default_partition_expiration_ms must be null or at least 3600000 (1 hour)."
}
}
variable "delete_contents_on_destroy" {
type = bool
default = false
description = "Allow `terraform destroy` to delete a dataset that still contains tables. Keep false in production."
}
variable "kms_key_name" {
type = string
default = null
description = "Dataset-level CMEK key (projects/../locations/../keyRings/../cryptoKeys/..). Region must match dataset location."
}
variable "labels" {
type = map(string)
default = {}
description = "Labels applied to the dataset for cost allocation and ownership."
}
variable "access" {
description = "Dataset access entries. Setting this REPLACES the default ACL, so include owners explicitly. Set exactly one principal field per entry."
default = []
type = list(object({
role = string
user_by_email = optional(string)
group_by_email = optional(string)
special_group = optional(string)
iam_member = optional(string)
}))
validation {
condition = alltrue([
for a in var.access :
length(compact([a.user_by_email, a.group_by_email, a.special_group, a.iam_member])) == 1
])
error_message = "Each access entry must set exactly one of user_by_email, group_by_email, special_group, or iam_member."
}
}
variable "tables" {
description = "Map of table_id => settings. Each table is created inside the dataset."
default = {}
type = map(object({
schema = string # JSON-encoded schema (use file()/jsonencode())
description = optional(string)
labels = optional(map(string), {})
deletion_protection = optional(bool, true)
clustering = optional(list(string), []) # up to 4 columns, must exist in schema
kms_key_name = optional(string)
time_partitioning = optional(object({
type = optional(string, "DAY") # DAY, HOUR, MONTH, YEAR
field = optional(string) # column to partition on; null = ingestion time
expiration_ms = optional(number) # per-partition TTL in ms
require_partition_filter = optional(bool, true) # force a partition predicate on every query
}))
}))
validation {
condition = alltrue([
for t in values(var.tables) : length(t.clustering) <= 4
])
error_message = "BigQuery allows at most 4 clustering columns per table."
}
validation {
condition = alltrue([
for t in values(var.tables) :
t.time_partitioning == null ? true :
contains(["DAY", "HOUR", "MONTH", "YEAR"], t.time_partitioning.type)
])
error_message = "time_partitioning.type must be one of DAY, HOUR, MONTH, or YEAR."
}
}
outputs.tf
output "dataset_id" {
description = "Dataset ID (the short id, e.g. sales_curated)."
value = google_bigquery_dataset.this.dataset_id
}
output "dataset_self_link" {
description = "Fully-qualified self link of the dataset."
value = google_bigquery_dataset.this.self_link
}
output "dataset_location" {
description = "Resolved dataset location (e.g. US, EU, asia-south1)."
value = google_bigquery_dataset.this.location
}
output "table_ids" {
description = "Map of table name => fully-qualified BigQuery table ID (project:dataset.table-style id field)."
value = { for k, t in google_bigquery_table.this : k => t.id }
}
output "table_references" {
description = "Map of table name => `project.dataset.table` reference, ready for SQL or downstream resources."
value = {
for k, t in google_bigquery_table.this :
k => "${var.project_id}.${google_bigquery_dataset.this.dataset_id}.${t.table_id}"
}
}
output "table_self_links" {
description = "Map of table name => self link."
value = { for k, t in google_bigquery_table.this : k => t.self_link }
}
How to use it
module "bigquery_dataset_table" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery?ref=v1.0.0"
project_id = "kloudvin-analytics-prod"
dataset_id = "sales_curated"
location = "US"
friendly_name = "Sales (Curated)"
description = "Curated, query-ready sales facts for the analytics domain."
# Customer-managed encryption to satisfy the data-governance baseline.
kms_key_name = "projects/kloudvin-analytics-prod/locations/us/keyRings/bq/cryptoKeys/sales"
# Replace the default ACL: BI group can read, the ELT SA owns it.
access = [
{
role = "OWNER"
user_by_email = "elt-loader@kloudvin-analytics-prod.iam.gserviceaccount.com"
},
{
role = "READER"
group_by_email = "bi-analysts@kloudvin.com"
},
]
tables = {
orders = {
description = "One row per order, partitioned by order date."
deletion_protection = true
schema = file("${path.module}/schemas/orders.json")
time_partitioning = {
type = "DAY"
field = "order_date"
expiration_ms = 63072000000 # 730 days
require_partition_filter = true
}
# Prune by customer then region within each daily partition.
clustering = ["customer_id", "region"]
labels = { dataset = "sales", grain = "order" }
}
}
labels = {
team = "analytics"
environment = "production"
cost-center = "data-platform"
}
}
# Downstream: a scheduled query / view that reads the curated table by reference.
resource "google_bigquery_table" "orders_daily_revenue" {
project = "kloudvin-analytics-prod"
dataset_id = module.bigquery_dataset_table.dataset_id
table_id = "orders_daily_revenue"
view {
use_legacy_sql = false
query = <<-SQL
SELECT order_date, region, SUM(amount) AS revenue
FROM `${module.bigquery_dataset_table.table_references["orders"]}`
WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY order_date, region
SQL
}
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "gcs"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...gcs state bucket/container + key per path...
}
}
2. Module config — live/prod/bigquery/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery?ref=v1.0.0"
}
inputs = {
project_id = "..."
dataset_id = "..."
location = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/bigquery && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
project_id |
string |
— | yes | GCP project ID that owns the dataset and tables. |
dataset_id |
string |
— | yes | Dataset ID; letters, numbers and underscores only. |
location |
string |
— | yes | Dataset location (US, EU, or a region). Immutable after create. |
friendly_name |
string |
null |
no | Console display name; defaults to dataset_id. |
description |
string |
null |
no | Free-text dataset description. |
default_table_expiration_ms |
number |
null |
no | Default new-table TTL in ms (null = never; min 1h). |
default_partition_expiration_ms |
number |
null |
no | Default partition TTL in ms (null = never; min 1h). |
delete_contents_on_destroy |
bool |
false |
no | Permit destroying a non-empty dataset. |
kms_key_name |
string |
null |
no | Dataset-level CMEK key; region must match location. |
labels |
map(string) |
{} |
no | Dataset labels for cost allocation. |
access |
list(object) |
[] |
no | Access entries; replaces the default ACL when set. |
tables |
map(object) |
{} |
no | Tables to create: schema, partitioning, clustering, CMEK, deletion protection. |
Outputs
| Name | Description |
|---|---|
dataset_id |
Short dataset ID (e.g. sales_curated). |
dataset_self_link |
Fully-qualified dataset self link. |
dataset_location |
Resolved dataset location. |
table_ids |
Map of table name => BigQuery table id. |
table_references |
Map of table name => project.dataset.table reference for SQL. |
table_self_links |
Map of table name => self link. |
Enterprise scenario
A retail analytics platform exposes a curated sales_curated dataset in the US multi-region to its BI team. They deploy this module once per environment: the ELT service account is granted OWNER and the bi-analysts Google Group READER via the access list, a CMEK key satisfies the security team’s encryption-at-rest control, and the orders table is partitioned by order_date with require_partition_filter = true and clustered on customer_id, region. The partition filter alone caps the cost of every dashboard query to the days it asks for, a 730-day partition expiry ages out old data automatically, and deletion_protection = true plus delete_contents_on_destroy = false means no terraform destroy can wipe the warehouse — all reviewed in a single pull request and reproduced identically in staging.
Best practices
- Always partition fact tables and force a partition filter. Set
time_partitioningon anything time-series and keeprequire_partition_filter = trueso a query without a date predicate fails fast instead of scanning the whole table — this is the single biggest BigQuery cost control. - Cluster on your real filter/join columns, in selectivity order. Up to four clustering columns prune storage blocks within a partition; order them most-selective-first (e.g.
customer_idbeforeregion) and make sure each name exists in the JSON schema or the apply fails. - Keep
deletion_protectionon anddelete_contents_on_destroyoff in production. A curated dataset is a shared asset; the per-table guard plus the dataset-level destroy guard stop a removedtableskey or a stray destroy from dropping populated tables. - Use default expirations on raw/staging, never on curated. Set
default_table_expiration_ms/default_partition_expiration_mson landing zones so transient data ages out and storage cost stays flat, but leave themnullon datasets that hold the system of record. - Manage access with
accessdeliberately — and prefer groups. Because settingaccessreplaces BigQuery’s default ACL, always include anOWNER; grant a Google GroupREADER/WRITERrather than individuals so leavers are handled in your IdP, and reserveOWNERfor the ELT/CI identity. - Pin
locationand CMEK for residency, and label everything.locationis immutable, so chooseUS/EU/region for data-residency up front; co-locate the CMEK key in the same region, and applyteam/environment/cost-centerlabels so finance can attribute storage and query spend back to the owning domain.