Terraform Module: GCP BigQuery Dataset & Table — governed, partitioned warehouse tables in one module

Quick take — A reusable hashicorp/google Terraform module for BigQuery: provision a dataset with location, default expirations, CMEK and access grants, plus partitioned, clustered tables with JSON schemas and deletion protection from typed variables. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "bigquery" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery?ref=v1.0.0"

  project_id = "..."  # GCP project ID that owns the dataset and tables.
  dataset_id = "..."  # Dataset ID; letters, numbers and underscores only.
  location   = "..."  # Dataset location (`US`, `EU`, or a region). Immutable a…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

BigQuery is Google Cloud’s serverless, columnar data warehouse: you create a dataset (a regional or multi-regional container that scopes location, IAM, default expirations and the CMEK key) and then tables inside it that hold the data. There is no cluster to size — you pay for bytes stored and for bytes scanned per query — which makes BigQuery cheap to start and very easy to turn into a surprise bill when an unpartitioned table gets SELECT *-ed over its entire history.

The two design decisions that control that cost live on the table: partitioning (usually by a DATE/TIMESTAMP column or by ingestion time, so a query with a date filter scans one day instead of three years) and clustering (sorting within a partition by up to four columns so equality and range filters prune blocks). Get those right, plus require_partition_filter to force callers to supply a partition predicate, and the same table can be an order of magnitude cheaper to query. Get them wrong — or forget them — and every query is a full-table scan.

A raw google_bigquery_dataset + google_bigquery_table pair also forces every team to re-decide the same governance details each time: the immutable location, default table/partition expirations, delete_contents_on_destroy, deletion_protection, the access list, and the encryption key. This module wraps google_bigquery_dataset and a for_each set of google_bigquery_table resources behind typed, validated variables so a consuming team passes intent — “a US dataset, CMEK-encrypted, two tables partitioned by day and clustered, protected” — and gets a correct, governed warehouse every time.

When to use it

You are standing up per-domain or per-team datasets (sales_curated, events_raw) and want consistent location, expiration, labels and IAM rather than hand-built one-offs.
You need tables that are partitioned and clustered by default so query cost stays bounded as data grows, and require_partition_filter to stop accidental full scans.
You want lifecycle hygiene — default table/partition expirations on a raw/staging dataset so transient data ages out automatically instead of accumulating storage cost forever.
You must satisfy encryption or residency requirements: a customer-managed (CMEK) key on the dataset and a pinned location that auditors can verify in code.
You want deletion protection and a deliberate delete_contents_on_destroy choice so a stray terraform destroy cannot silently drop a populated dataset.
Reach for plain SQL CREATE TABLE or bq instead when you are doing throwaway exploration where none of this governance matters.

Module structure

terraform-module-gcp-bigquery/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # dataset + access + CMEK + tables
├── variables.tf     # typed, validated inputs
└── outputs.tf       # dataset id/self_link + table ids/refs

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

resource "google_bigquery_dataset" "this" {
  project    = var.project_id
  dataset_id = var.dataset_id
  location   = var.location

  friendly_name = coalesce(var.friendly_name, var.dataset_id)
  description   = var.description

  # Defaults inherited by tables that do not set their own expiration.
  # null = never expire. Values are in milliseconds.
  default_table_expiration_ms     = var.default_table_expiration_ms
  default_partition_expiration_ms = var.default_partition_expiration_ms

  # When true, `terraform destroy` can drop a dataset that still has tables.
  # Keep false for any dataset that holds real data.
  delete_contents_on_destroy = var.delete_contents_on_destroy

  labels = var.labels

  # Customer-managed encryption for data at rest (optional).
  dynamic "default_encryption_configuration" {
    for_each = var.kms_key_name == null ? [] : [var.kms_key_name]
    content {
      kms_key_name = default_encryption_configuration.value
    }
  }

  # Access grants. Each entry sets exactly one principal type
  # (user_by_email / group_by_email / special_group / iam_member)
  # against a role. Setting `access` replaces BigQuery's default ACL,
  # so include owners explicitly when you use it.
  dynamic "access" {
    for_each = var.access
    content {
      role           = access.value.role
      user_by_email  = access.value.user_by_email
      group_by_email = access.value.group_by_email
      special_group  = access.value.special_group
      iam_member     = access.value.iam_member
    }
  }
}

resource "google_bigquery_table" "this" {
  for_each = var.tables

  project    = var.project_id
  dataset_id = google_bigquery_dataset.this.dataset_id
  table_id   = each.key

  description = each.value.description
  schema      = each.value.schema
  labels      = each.value.labels

  # Guards the table against `terraform destroy` / replacement.
  deletion_protection = each.value.deletion_protection

  clustering = each.value.clustering

  # Column-based or ingestion-time DAY/HOUR/MONTH/YEAR partitioning.
  dynamic "time_partitioning" {
    for_each = each.value.time_partitioning == null ? [] : [each.value.time_partitioning]
    content {
      type                     = time_partitioning.value.type
      field                    = time_partitioning.value.field
      expiration_ms            = time_partitioning.value.expiration_ms
      require_partition_filter = time_partitioning.value.require_partition_filter
    }
  }

  # Per-table CMEK override; falls back to the dataset key when null.
  dynamic "encryption_configuration" {
    for_each = each.value.kms_key_name == null ? [] : [each.value.kms_key_name]
    content {
      kms_key_name = encryption_configuration.value
    }
  }
}

variables.tf

variable "project_id" {
  type        = string
  description = "GCP project ID that owns the dataset and tables."
}

variable "dataset_id" {
  type        = string
  description = "Dataset ID. Letters, numbers and underscores only; up to 1024 chars."

  validation {
    condition     = can(regex("^[A-Za-z0-9_]{1,1024}$", var.dataset_id))
    error_message = "dataset_id may contain only letters, numbers and underscores (no hyphens or dots)."
  }
}

variable "location" {
  type        = string
  description = "Dataset location (e.g. 'US', 'EU', or a region like 'asia-south1'). IMMUTABLE after create."

  validation {
    condition     = length(var.location) > 0
    error_message = "location must be set; it cannot be changed after the dataset is created."
  }
}

variable "friendly_name" {
  type        = string
  default     = null
  description = "Human-readable dataset name shown in the console. Defaults to dataset_id."
}

variable "description" {
  type        = string
  default     = null
  description = "Free-text description of the dataset's purpose."
}

variable "default_table_expiration_ms" {
  type        = number
  default     = null
  description = "Default lifetime for new tables, in ms (null = never). Min 3600000 (1h) when set."

  validation {
    condition     = var.default_table_expiration_ms == null || var.default_table_expiration_ms >= 3600000
    error_message = "default_table_expiration_ms must be null or at least 3600000 (1 hour)."
  }
}

variable "default_partition_expiration_ms" {
  type        = number
  default     = null
  description = "Default lifetime for partitions in new partitioned tables, in ms (null = never)."

  validation {
    condition     = var.default_partition_expiration_ms == null || var.default_partition_expiration_ms >= 3600000
    error_message = "default_partition_expiration_ms must be null or at least 3600000 (1 hour)."
  }
}

variable "delete_contents_on_destroy" {
  type        = bool
  default     = false
  description = "Allow `terraform destroy` to delete a dataset that still contains tables. Keep false in production."
}

variable "kms_key_name" {
  type        = string
  default     = null
  description = "Dataset-level CMEK key (projects/../locations/../keyRings/../cryptoKeys/..). Region must match dataset location."
}

variable "labels" {
  type        = map(string)
  default     = {}
  description = "Labels applied to the dataset for cost allocation and ownership."
}

variable "access" {
  description = "Dataset access entries. Setting this REPLACES the default ACL, so include owners explicitly. Set exactly one principal field per entry."
  default     = []
  type = list(object({
    role           = string
    user_by_email  = optional(string)
    group_by_email = optional(string)
    special_group  = optional(string)
    iam_member     = optional(string)
  }))

  validation {
    condition = alltrue([
      for a in var.access :
      length(compact([a.user_by_email, a.group_by_email, a.special_group, a.iam_member])) == 1
    ])
    error_message = "Each access entry must set exactly one of user_by_email, group_by_email, special_group, or iam_member."
  }
}

variable "tables" {
  description = "Map of table_id => settings. Each table is created inside the dataset."
  default     = {}
  type = map(object({
    schema              = string                     # JSON-encoded schema (use file()/jsonencode())
    description         = optional(string)
    labels              = optional(map(string), {})
    deletion_protection = optional(bool, true)
    clustering          = optional(list(string), []) # up to 4 columns, must exist in schema
    kms_key_name        = optional(string)
    time_partitioning = optional(object({
      type                     = optional(string, "DAY") # DAY, HOUR, MONTH, YEAR
      field                    = optional(string)        # column to partition on; null = ingestion time
      expiration_ms            = optional(number)        # per-partition TTL in ms
      require_partition_filter = optional(bool, true)    # force a partition predicate on every query
    }))
  }))

  validation {
    condition = alltrue([
      for t in values(var.tables) : length(t.clustering) <= 4
    ])
    error_message = "BigQuery allows at most 4 clustering columns per table."
  }

  validation {
    condition = alltrue([
      for t in values(var.tables) :
      t.time_partitioning == null ? true :
      contains(["DAY", "HOUR", "MONTH", "YEAR"], t.time_partitioning.type)
    ])
    error_message = "time_partitioning.type must be one of DAY, HOUR, MONTH, or YEAR."
  }
}

outputs.tf

output "dataset_id" {
  description = "Dataset ID (the short id, e.g. sales_curated)."
  value       = google_bigquery_dataset.this.dataset_id
}

output "dataset_self_link" {
  description = "Fully-qualified self link of the dataset."
  value       = google_bigquery_dataset.this.self_link
}

output "dataset_location" {
  description = "Resolved dataset location (e.g. US, EU, asia-south1)."
  value       = google_bigquery_dataset.this.location
}

output "table_ids" {
  description = "Map of table name => fully-qualified BigQuery table ID (project:dataset.table-style id field)."
  value       = { for k, t in google_bigquery_table.this : k => t.id }
}

output "table_references" {
  description = "Map of table name => `project.dataset.table` reference, ready for SQL or downstream resources."
  value = {
    for k, t in google_bigquery_table.this :
    k => "${var.project_id}.${google_bigquery_dataset.this.dataset_id}.${t.table_id}"
  }
}

output "table_self_links" {
  description = "Map of table name => self link."
  value       = { for k, t in google_bigquery_table.this : k => t.self_link }
}

How to use it

module "bigquery_dataset_table" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery?ref=v1.0.0"

  project_id    = "kloudvin-analytics-prod"
  dataset_id    = "sales_curated"
  location      = "US"
  friendly_name = "Sales (Curated)"
  description   = "Curated, query-ready sales facts for the analytics domain."

  # Customer-managed encryption to satisfy the data-governance baseline.
  kms_key_name = "projects/kloudvin-analytics-prod/locations/us/keyRings/bq/cryptoKeys/sales"

  # Replace the default ACL: BI group can read, the ELT SA owns it.
  access = [
    {
      role          = "OWNER"
      user_by_email = "elt-loader@kloudvin-analytics-prod.iam.gserviceaccount.com"
    },
    {
      role           = "READER"
      group_by_email = "bi-analysts@kloudvin.com"
    },
  ]

  tables = {
    orders = {
      description         = "One row per order, partitioned by order date."
      deletion_protection = true
      schema              = file("${path.module}/schemas/orders.json")

      time_partitioning = {
        type                     = "DAY"
        field                    = "order_date"
        expiration_ms            = 63072000000 # 730 days
        require_partition_filter = true
      }

      # Prune by customer then region within each daily partition.
      clustering = ["customer_id", "region"]

      labels = { dataset = "sales", grain = "order" }
    }
  }

  labels = {
    team        = "analytics"
    environment = "production"
    cost-center = "data-platform"
  }
}

# Downstream: a scheduled query / view that reads the curated table by reference.
resource "google_bigquery_table" "orders_daily_revenue" {
  project    = "kloudvin-analytics-prod"
  dataset_id = module.bigquery_dataset_table.dataset_id
  table_id   = "orders_daily_revenue"

  view {
    use_legacy_sql = false
    query          = <<-SQL
      SELECT order_date, region, SUM(amount) AS revenue
      FROM `${module.bigquery_dataset_table.table_references["orders"]}`
      WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
      GROUP BY order_date, region
    SQL
  }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module config — live/prod/bigquery/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-bigquery?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  dataset_id = "..."
  location = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/bigquery && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`project_id`	`string`	—	yes	GCP project ID that owns the dataset and tables.
`dataset_id`	`string`	—	yes	Dataset ID; letters, numbers and underscores only.
`location`	`string`	—	yes	Dataset location (`US`, `EU`, or a region). Immutable after create.
`friendly_name`	`string`	`null`	no	Console display name; defaults to `dataset_id`.
`description`	`string`	`null`	no	Free-text dataset description.
`default_table_expiration_ms`	`number`	`null`	no	Default new-table TTL in ms (`null` = never; min 1h).
`default_partition_expiration_ms`	`number`	`null`	no	Default partition TTL in ms (`null` = never; min 1h).
`delete_contents_on_destroy`	`bool`	`false`	no	Permit destroying a non-empty dataset.
`kms_key_name`	`string`	`null`	no	Dataset-level CMEK key; region must match `location`.
`labels`	`map(string)`	`{}`	no	Dataset labels for cost allocation.
`access`	`list(object)`	`[]`	no	Access entries; replaces the default ACL when set.
`tables`	`map(object)`	`{}`	no	Tables to create: schema, partitioning, clustering, CMEK, deletion protection.

Outputs

Name	Description
`dataset_id`	Short dataset ID (e.g. `sales_curated`).
`dataset_self_link`	Fully-qualified dataset self link.
`dataset_location`	Resolved dataset location.
`table_ids`	Map of table name => BigQuery table `id`.
`table_references`	Map of table name => `project.dataset.table` reference for SQL.
`table_self_links`	Map of table name => self link.

Enterprise scenario

A retail analytics platform exposes a curated sales_curated dataset in the US multi-region to its BI team. They deploy this module once per environment: the ELT service account is granted OWNER and the bi-analysts Google Group READER via the access list, a CMEK key satisfies the security team’s encryption-at-rest control, and the orders table is partitioned by order_date with require_partition_filter = true and clustered on customer_id, region. The partition filter alone caps the cost of every dashboard query to the days it asks for, a 730-day partition expiry ages out old data automatically, and deletion_protection = true plus delete_contents_on_destroy = false means no terraform destroy can wipe the warehouse — all reviewed in a single pull request and reproduced identically in staging.

Best practices

Always partition fact tables and force a partition filter. Set time_partitioning on anything time-series and keep require_partition_filter = true so a query without a date predicate fails fast instead of scanning the whole table — this is the single biggest BigQuery cost control.
Cluster on your real filter/join columns, in selectivity order. Up to four clustering columns prune storage blocks within a partition; order them most-selective-first (e.g. customer_id before region) and make sure each name exists in the JSON schema or the apply fails.
Keep deletion_protection on and delete_contents_on_destroy off in production. A curated dataset is a shared asset; the per-table guard plus the dataset-level destroy guard stop a removed tables key or a stray destroy from dropping populated tables.
Use default expirations on raw/staging, never on curated. Set default_table_expiration_ms / default_partition_expiration_ms on landing zones so transient data ages out and storage cost stays flat, but leave them null on datasets that hold the system of record.
Manage access with access deliberately — and prefer groups. Because setting access replaces BigQuery’s default ACL, always include an OWNER; grant a Google Group READER/WRITER rather than individuals so leavers are handled in your IdP, and reserve OWNER for the ELT/CI identity.
Pin location and CMEK for residency, and label everything. location is immutable, so choose US/EU/region for data-residency up front; co-locate the CMEK key in the same region, and apply team/environment/cost-center labels so finance can attribute storage and query spend back to the owning domain.