IaC GCP

Terraform Module: GCP BigQuery Dataset & Table — governed, partitioned warehouse tables in one module

Quick take — A reusable hashicorp/google Terraform module for BigQuery: provision a dataset with location, default expirations, CMEK and access grants, plus partitioned, clustered tables with JSON schemas and deletion protection from typed variables. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "bigquery" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery?ref=v1.0.0"

  project_id = "..."  # GCP project ID that owns the dataset and tables.
  dataset_id = "..."  # Dataset ID; letters, numbers and underscores only.
  location   = "..."  # Dataset location (`US`, `EU`, or a region). Immutable a…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

BigQuery is Google Cloud’s serverless, columnar data warehouse: you create a dataset (a regional or multi-regional container that scopes location, IAM, default expirations and the CMEK key) and then tables inside it that hold the data. There is no cluster to size — you pay for bytes stored and for bytes scanned per query — which makes BigQuery cheap to start and very easy to turn into a surprise bill when an unpartitioned table gets SELECT *-ed over its entire history.

The two design decisions that control that cost live on the table: partitioning (usually by a DATE/TIMESTAMP column or by ingestion time, so a query with a date filter scans one day instead of three years) and clustering (sorting within a partition by up to four columns so equality and range filters prune blocks). Get those right, plus require_partition_filter to force callers to supply a partition predicate, and the same table can be an order of magnitude cheaper to query. Get them wrong — or forget them — and every query is a full-table scan.

A raw google_bigquery_dataset + google_bigquery_table pair also forces every team to re-decide the same governance details each time: the immutable location, default table/partition expirations, delete_contents_on_destroy, deletion_protection, the access list, and the encryption key. This module wraps google_bigquery_dataset and a for_each set of google_bigquery_table resources behind typed, validated variables so a consuming team passes intent — “a US dataset, CMEK-encrypted, two tables partitioned by day and clustered, protected” — and gets a correct, governed warehouse every time.

When to use it

Module structure

terraform-module-gcp-bigquery/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # dataset + access + CMEK + tables
├── variables.tf     # typed, validated inputs
└── outputs.tf       # dataset id/self_link + table ids/refs

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

resource "google_bigquery_dataset" "this" {
  project    = var.project_id
  dataset_id = var.dataset_id
  location   = var.location

  friendly_name = coalesce(var.friendly_name, var.dataset_id)
  description   = var.description

  # Defaults inherited by tables that do not set their own expiration.
  # null = never expire. Values are in milliseconds.
  default_table_expiration_ms     = var.default_table_expiration_ms
  default_partition_expiration_ms = var.default_partition_expiration_ms

  # When true, `terraform destroy` can drop a dataset that still has tables.
  # Keep false for any dataset that holds real data.
  delete_contents_on_destroy = var.delete_contents_on_destroy

  labels = var.labels

  # Customer-managed encryption for data at rest (optional).
  dynamic "default_encryption_configuration" {
    for_each = var.kms_key_name == null ? [] : [var.kms_key_name]
    content {
      kms_key_name = default_encryption_configuration.value
    }
  }

  # Access grants. Each entry sets exactly one principal type
  # (user_by_email / group_by_email / special_group / iam_member)
  # against a role. Setting `access` replaces BigQuery's default ACL,
  # so include owners explicitly when you use it.
  dynamic "access" {
    for_each = var.access
    content {
      role           = access.value.role
      user_by_email  = access.value.user_by_email
      group_by_email = access.value.group_by_email
      special_group  = access.value.special_group
      iam_member     = access.value.iam_member
    }
  }
}

resource "google_bigquery_table" "this" {
  for_each = var.tables

  project    = var.project_id
  dataset_id = google_bigquery_dataset.this.dataset_id
  table_id   = each.key

  description = each.value.description
  schema      = each.value.schema
  labels      = each.value.labels

  # Guards the table against `terraform destroy` / replacement.
  deletion_protection = each.value.deletion_protection

  clustering = each.value.clustering

  # Column-based or ingestion-time DAY/HOUR/MONTH/YEAR partitioning.
  dynamic "time_partitioning" {
    for_each = each.value.time_partitioning == null ? [] : [each.value.time_partitioning]
    content {
      type                     = time_partitioning.value.type
      field                    = time_partitioning.value.field
      expiration_ms            = time_partitioning.value.expiration_ms
      require_partition_filter = time_partitioning.value.require_partition_filter
    }
  }

  # Per-table CMEK override; falls back to the dataset key when null.
  dynamic "encryption_configuration" {
    for_each = each.value.kms_key_name == null ? [] : [each.value.kms_key_name]
    content {
      kms_key_name = encryption_configuration.value
    }
  }
}

variables.tf

variable "project_id" {
  type        = string
  description = "GCP project ID that owns the dataset and tables."
}

variable "dataset_id" {
  type        = string
  description = "Dataset ID. Letters, numbers and underscores only; up to 1024 chars."

  validation {
    condition     = can(regex("^[A-Za-z0-9_]{1,1024}$", var.dataset_id))
    error_message = "dataset_id may contain only letters, numbers and underscores (no hyphens or dots)."
  }
}

variable "location" {
  type        = string
  description = "Dataset location (e.g. 'US', 'EU', or a region like 'asia-south1'). IMMUTABLE after create."

  validation {
    condition     = length(var.location) > 0
    error_message = "location must be set; it cannot be changed after the dataset is created."
  }
}

variable "friendly_name" {
  type        = string
  default     = null
  description = "Human-readable dataset name shown in the console. Defaults to dataset_id."
}

variable "description" {
  type        = string
  default     = null
  description = "Free-text description of the dataset's purpose."
}

variable "default_table_expiration_ms" {
  type        = number
  default     = null
  description = "Default lifetime for new tables, in ms (null = never). Min 3600000 (1h) when set."

  validation {
    condition     = var.default_table_expiration_ms == null || var.default_table_expiration_ms >= 3600000
    error_message = "default_table_expiration_ms must be null or at least 3600000 (1 hour)."
  }
}

variable "default_partition_expiration_ms" {
  type        = number
  default     = null
  description = "Default lifetime for partitions in new partitioned tables, in ms (null = never)."

  validation {
    condition     = var.default_partition_expiration_ms == null || var.default_partition_expiration_ms >= 3600000
    error_message = "default_partition_expiration_ms must be null or at least 3600000 (1 hour)."
  }
}

variable "delete_contents_on_destroy" {
  type        = bool
  default     = false
  description = "Allow `terraform destroy` to delete a dataset that still contains tables. Keep false in production."
}

variable "kms_key_name" {
  type        = string
  default     = null
  description = "Dataset-level CMEK key (projects/../locations/../keyRings/../cryptoKeys/..). Region must match dataset location."
}

variable "labels" {
  type        = map(string)
  default     = {}
  description = "Labels applied to the dataset for cost allocation and ownership."
}

variable "access" {
  description = "Dataset access entries. Setting this REPLACES the default ACL, so include owners explicitly. Set exactly one principal field per entry."
  default     = []
  type = list(object({
    role           = string
    user_by_email  = optional(string)
    group_by_email = optional(string)
    special_group  = optional(string)
    iam_member     = optional(string)
  }))

  validation {
    condition = alltrue([
      for a in var.access :
      length(compact([a.user_by_email, a.group_by_email, a.special_group, a.iam_member])) == 1
    ])
    error_message = "Each access entry must set exactly one of user_by_email, group_by_email, special_group, or iam_member."
  }
}

variable "tables" {
  description = "Map of table_id => settings. Each table is created inside the dataset."
  default     = {}
  type = map(object({
    schema              = string                     # JSON-encoded schema (use file()/jsonencode())
    description         = optional(string)
    labels              = optional(map(string), {})
    deletion_protection = optional(bool, true)
    clustering          = optional(list(string), []) # up to 4 columns, must exist in schema
    kms_key_name        = optional(string)
    time_partitioning = optional(object({
      type                     = optional(string, "DAY") # DAY, HOUR, MONTH, YEAR
      field                    = optional(string)        # column to partition on; null = ingestion time
      expiration_ms            = optional(number)        # per-partition TTL in ms
      require_partition_filter = optional(bool, true)    # force a partition predicate on every query
    }))
  }))

  validation {
    condition = alltrue([
      for t in values(var.tables) : length(t.clustering) <= 4
    ])
    error_message = "BigQuery allows at most 4 clustering columns per table."
  }

  validation {
    condition = alltrue([
      for t in values(var.tables) :
      t.time_partitioning == null ? true :
      contains(["DAY", "HOUR", "MONTH", "YEAR"], t.time_partitioning.type)
    ])
    error_message = "time_partitioning.type must be one of DAY, HOUR, MONTH, or YEAR."
  }
}

outputs.tf

output "dataset_id" {
  description = "Dataset ID (the short id, e.g. sales_curated)."
  value       = google_bigquery_dataset.this.dataset_id
}

output "dataset_self_link" {
  description = "Fully-qualified self link of the dataset."
  value       = google_bigquery_dataset.this.self_link
}

output "dataset_location" {
  description = "Resolved dataset location (e.g. US, EU, asia-south1)."
  value       = google_bigquery_dataset.this.location
}

output "table_ids" {
  description = "Map of table name => fully-qualified BigQuery table ID (project:dataset.table-style id field)."
  value       = { for k, t in google_bigquery_table.this : k => t.id }
}

output "table_references" {
  description = "Map of table name => `project.dataset.table` reference, ready for SQL or downstream resources."
  value = {
    for k, t in google_bigquery_table.this :
    k => "${var.project_id}.${google_bigquery_dataset.this.dataset_id}.${t.table_id}"
  }
}

output "table_self_links" {
  description = "Map of table name => self link."
  value       = { for k, t in google_bigquery_table.this : k => t.self_link }
}

How to use it

module "bigquery_dataset_table" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery?ref=v1.0.0"

  project_id    = "kloudvin-analytics-prod"
  dataset_id    = "sales_curated"
  location      = "US"
  friendly_name = "Sales (Curated)"
  description   = "Curated, query-ready sales facts for the analytics domain."

  # Customer-managed encryption to satisfy the data-governance baseline.
  kms_key_name = "projects/kloudvin-analytics-prod/locations/us/keyRings/bq/cryptoKeys/sales"

  # Replace the default ACL: BI group can read, the ELT SA owns it.
  access = [
    {
      role          = "OWNER"
      user_by_email = "elt-loader@kloudvin-analytics-prod.iam.gserviceaccount.com"
    },
    {
      role           = "READER"
      group_by_email = "bi-analysts@kloudvin.com"
    },
  ]

  tables = {
    orders = {
      description         = "One row per order, partitioned by order date."
      deletion_protection = true
      schema              = file("${path.module}/schemas/orders.json")

      time_partitioning = {
        type                     = "DAY"
        field                    = "order_date"
        expiration_ms            = 63072000000 # 730 days
        require_partition_filter = true
      }

      # Prune by customer then region within each daily partition.
      clustering = ["customer_id", "region"]

      labels = { dataset = "sales", grain = "order" }
    }
  }

  labels = {
    team        = "analytics"
    environment = "production"
    cost-center = "data-platform"
  }
}

# Downstream: a scheduled query / view that reads the curated table by reference.
resource "google_bigquery_table" "orders_daily_revenue" {
  project    = "kloudvin-analytics-prod"
  dataset_id = module.bigquery_dataset_table.dataset_id
  table_id   = "orders_daily_revenue"

  view {
    use_legacy_sql = false
    query          = <<-SQL
      SELECT order_date, region, SUM(amount) AS revenue
      FROM `${module.bigquery_dataset_table.table_references["orders"]}`
      WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
      GROUP BY order_date, region
    SQL
  }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module configlive/prod/bigquery/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  dataset_id = "..."
  location = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/bigquery && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
project_id string yes GCP project ID that owns the dataset and tables.
dataset_id string yes Dataset ID; letters, numbers and underscores only.
location string yes Dataset location (US, EU, or a region). Immutable after create.
friendly_name string null no Console display name; defaults to dataset_id.
description string null no Free-text dataset description.
default_table_expiration_ms number null no Default new-table TTL in ms (null = never; min 1h).
default_partition_expiration_ms number null no Default partition TTL in ms (null = never; min 1h).
delete_contents_on_destroy bool false no Permit destroying a non-empty dataset.
kms_key_name string null no Dataset-level CMEK key; region must match location.
labels map(string) {} no Dataset labels for cost allocation.
access list(object) [] no Access entries; replaces the default ACL when set.
tables map(object) {} no Tables to create: schema, partitioning, clustering, CMEK, deletion protection.

Outputs

Name Description
dataset_id Short dataset ID (e.g. sales_curated).
dataset_self_link Fully-qualified dataset self link.
dataset_location Resolved dataset location.
table_ids Map of table name => BigQuery table id.
table_references Map of table name => project.dataset.table reference for SQL.
table_self_links Map of table name => self link.

Enterprise scenario

A retail analytics platform exposes a curated sales_curated dataset in the US multi-region to its BI team. They deploy this module once per environment: the ELT service account is granted OWNER and the bi-analysts Google Group READER via the access list, a CMEK key satisfies the security team’s encryption-at-rest control, and the orders table is partitioned by order_date with require_partition_filter = true and clustered on customer_id, region. The partition filter alone caps the cost of every dashboard query to the days it asks for, a 730-day partition expiry ages out old data automatically, and deletion_protection = true plus delete_contents_on_destroy = false means no terraform destroy can wipe the warehouse — all reviewed in a single pull request and reproduced identically in staging.

Best practices

TerraformGCPBigQuery Dataset & TableModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading