Quick take — A reusable hashicorp/google Terraform module for google_dataproc_cluster: private clusters, autoscaling policies, component gateway, preemptible secondary workers, CMEK encryption and staging buckets. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "google" {
project = "my-project"
region = "us-central1"
}
module "dataproc" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataproc?ref=v1.0.0"
project_id = "..." # GCP project ID where the cluster is created.
region = "..." # Dataproc region; must match the subnetwork region.
cluster_name = "..." # Cluster name; lowercased, `^[a-z]([-a-z0-9]*[a-z0-9])?$…
environment = "..." # One of `dev`, `stg`, `prod`; drives staging-bucket `for…
subnetwork = "..." # Subnetwork self-link/name; required for `internal_ip_on…
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Cloud Dataproc is GCP’s managed Apache Spark, Hadoop, Hive, Presto and Flink service. You hand it a cluster shape — a master, some workers, a region — and it provisions the VMs, installs the open-source stack, wires up YARN/HDFS, and lets you submit jobs over the Dataproc Jobs API or the Component Gateway UIs. The catch is that a production Dataproc cluster is never just a master and two workers. It needs a deterministic staging and temp bucket (otherwise Dataproc auto-creates one per region and you lose track of cost), an autoscaling policy so the secondary worker pool grows under load and shrinks when idle, a private network with internal_ip_only = true so nodes never get external IPs, CMEK so the persistent disks and bucket are encrypted with your own KMS key, and a service account scoped to least privilege instead of the default Compute Engine SA.
Wiring all of that by hand in every project is where drift and copy-paste bugs creep in. This module wraps google_dataproc_cluster (plus an optional google_dataproc_autoscaling_policy) so a team declares the intent — “a private, autoscaling, encrypted Spark cluster in europe-west1” — in a dozen lines, and the module enforces the safe defaults: image version pinning, ephemeral cluster TTL, shielded VMs, and a single staging bucket.
When to use it
- You run batch Spark/PySpark or Hive jobs on GCP and want repeatable, ephemeral-or-long-lived clusters provisioned through IaC rather than the
gcloud dataproc clusters createflag soup. - You need autoscaling so transient ETL workloads spin secondary (preemptible) workers up and down without a human resizing the cluster.
- You operate in a regulated or security-conscious environment that mandates private IPs, customer-managed encryption keys (CMEK), and non-default service accounts.
- You want a standard staging/temp bucket lifecycle and consistent labelling for cost allocation across many teams’ clusters.
- Reach for Dataproc Serverless (
google_dataproc_batch) instead if you only run short batch jobs and never need a persistent cluster or HDFS — this module is for the managed cluster model.
Module structure
terraform-module-gcp-dataproc/
├── versions.tf # provider + Terraform version pins
├── main.tf # autoscaling policy + dataproc cluster + staging bucket
├── variables.tf # var-driven inputs with validation
└── outputs.tf # cluster id/name + HTTP ports, bucket, policy id
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
main.tf
locals {
# Dataproc requires lowercase names matching ^[a-z]([-a-z0-9]*[a-z0-9])?$
cluster_name = lower(var.cluster_name)
# Build the optional autoscaling block only when a policy is requested.
autoscaling_enabled = var.autoscaling.max_secondary_workers > 0
labels = merge(
{
managed-by = "terraform"
module = "gcp-dataproc"
env = var.environment
},
var.labels,
)
}
# Deterministic staging/temp bucket so Dataproc does not auto-create one per region.
resource "google_storage_bucket" "staging" {
count = var.create_staging_bucket ? 1 : 0
name = coalesce(var.staging_bucket_name, "${local.cluster_name}-${var.project_id}-dpstage")
project = var.project_id
location = var.region
storage_class = "STANDARD"
uniform_bucket_level_access = true
force_destroy = var.environment != "prod"
lifecycle_rule {
condition {
age = var.staging_bucket_age_days
}
action {
type = "Delete"
}
}
labels = local.labels
}
# Autoscaling policy governing the secondary (preemptible) worker pool.
resource "google_dataproc_autoscaling_policy" "this" {
count = local.autoscaling_enabled ? 1 : 0
policy_id = "${local.cluster_name}-asp"
project = var.project_id
location = var.region
worker_config {
min_instances = var.num_workers
max_instances = var.num_workers
}
secondary_worker_config {
min_instances = var.autoscaling.min_secondary_workers
max_instances = var.autoscaling.max_secondary_workers
weight = 1
}
basic_algorithm {
cooldown_period = var.autoscaling.cooldown_period
yarn_config {
graceful_decommission_timeout = var.autoscaling.graceful_decommission_timeout
scale_up_factor = var.autoscaling.scale_up_factor
scale_down_factor = var.autoscaling.scale_down_factor
scale_up_min_worker_fraction = 0.0
scale_down_min_worker_fraction = 0.0
}
}
}
resource "google_dataproc_cluster" "this" {
name = local.cluster_name
project = var.project_id
region = var.region
labels = local.labels
# Optional graceful shutdown of ephemeral clusters.
dynamic "cluster_config" {
for_each = [1]
content {
staging_bucket = var.create_staging_bucket ? google_storage_bucket.staging[0].name : var.staging_bucket_name
gce_cluster_config {
zone = var.zone
subnetwork = var.subnetwork
internal_ip_only = var.internal_ip_only
service_account = var.service_account
service_account_scopes = ["cloud-platform"]
tags = var.network_tags
shielded_instance_config {
enable_secure_boot = true
enable_vtpm = true
enable_integrity_monitoring = true
}
}
master_config {
num_instances = var.num_masters
machine_type = var.master_machine_type
disk_config {
boot_disk_type = var.master_disk_type
boot_disk_size_gb = var.master_disk_size_gb
}
}
worker_config {
num_instances = var.num_workers
machine_type = var.worker_machine_type
disk_config {
boot_disk_type = var.worker_disk_type
boot_disk_size_gb = var.worker_disk_size_gb
num_local_ssds = var.worker_num_local_ssds
}
}
# Preemptible/spot secondary workers, sized by the autoscaling policy.
dynamic "preemptible_worker_config" {
for_each = local.autoscaling_enabled ? [1] : []
content {
num_instances = var.autoscaling.min_secondary_workers
preemptibility = var.secondary_worker_preemptibility
disk_config {
boot_disk_type = var.worker_disk_type
boot_disk_size_gb = var.worker_disk_size_gb
}
}
}
software_config {
image_version = var.image_version
optional_components = var.optional_components
override_properties = var.cluster_properties
}
# Attach the autoscaling policy when one was created.
dynamic "autoscaling_config" {
for_each = local.autoscaling_enabled ? [1] : []
content {
policy_uri = google_dataproc_autoscaling_policy.this[0].id
}
}
# Component Gateway exposes the Spark/YARN/Jupyter web UIs over an IAP-aware proxy.
endpoint_config {
enable_http_port_access = var.enable_component_gateway
}
# CMEK: encrypt PD and the cluster's metadata with a customer key.
dynamic "encryption_config" {
for_each = var.kms_key_name == null ? [] : [1]
content {
kms_key_name = var.kms_key_name
}
}
# Auto-delete idle ephemeral clusters to cap cost.
dynamic "lifecycle_config" {
for_each = var.idle_delete_ttl == null ? [] : [1]
content {
idle_delete_ttl = var.idle_delete_ttl
}
}
}
}
timeouts {
create = "45m"
update = "45m"
delete = "45m"
}
}
variables.tf
variable "project_id" {
description = "GCP project ID where the Dataproc cluster is created."
type = string
}
variable "region" {
description = "Dataproc region, e.g. europe-west1. Must match the subnetwork region."
type = string
}
variable "zone" {
description = "Specific zone for cluster VMs. Leave empty for Dataproc Auto Zone placement."
type = string
default = ""
}
variable "cluster_name" {
description = "Cluster name. Lowercased; must match ^[a-z]([-a-z0-9]*[a-z0-9])?$ and be <= 51 chars."
type = string
validation {
condition = can(regex("^[a-z]([-a-z0-9]*[a-z0-9])?$", lower(var.cluster_name))) && length(var.cluster_name) <= 51
error_message = "cluster_name must be <= 51 chars, lowercase letters/digits/hyphens, starting with a letter."
}
}
variable "environment" {
description = "Environment label (dev/stg/prod). Drives force_destroy on the staging bucket."
type = string
validation {
condition = contains(["dev", "stg", "prod"], var.environment)
error_message = "environment must be one of: dev, stg, prod."
}
}
variable "subnetwork" {
description = "Self-link or short name of the subnetwork. Required when internal_ip_only = true."
type = string
}
variable "internal_ip_only" {
description = "When true, cluster VMs get no external IPs (requires Private Google Access / Cloud NAT)."
type = bool
default = true
}
variable "service_account" {
description = "Service account email for cluster VMs. Null uses the default Compute Engine SA (not recommended)."
type = string
default = null
}
variable "network_tags" {
description = "Network tags applied to cluster VMs for firewall targeting."
type = list(string)
default = []
}
variable "num_masters" {
description = "Number of masters. Use 1 for standard, 3 for High Availability mode."
type = number
default = 1
validation {
condition = contains([1, 3], var.num_masters)
error_message = "num_masters must be 1 (standard) or 3 (HA)."
}
}
variable "master_machine_type" {
description = "Machine type for the master node(s)."
type = string
default = "n2-standard-4"
}
variable "master_disk_type" {
description = "Boot disk type for masters (pd-standard, pd-balanced, pd-ssd)."
type = string
default = "pd-balanced"
}
variable "master_disk_size_gb" {
description = "Boot disk size (GB) for masters. Minimum 30."
type = number
default = 100
validation {
condition = var.master_disk_size_gb >= 30
error_message = "master_disk_size_gb must be at least 30."
}
}
variable "num_workers" {
description = "Number of primary (non-preemptible) workers. Minimum 2 for HDFS replication."
type = number
default = 2
validation {
condition = var.num_workers >= 2
error_message = "num_workers must be at least 2 so HDFS can replicate blocks."
}
}
variable "worker_machine_type" {
description = "Machine type for primary workers."
type = string
default = "n2-standard-4"
}
variable "worker_disk_type" {
description = "Boot disk type for workers (pd-standard, pd-balanced, pd-ssd)."
type = string
default = "pd-balanced"
}
variable "worker_disk_size_gb" {
description = "Boot disk size (GB) for workers. Minimum 30."
type = number
default = 200
validation {
condition = var.worker_disk_size_gb >= 30
error_message = "worker_disk_size_gb must be at least 30."
}
}
variable "worker_num_local_ssds" {
description = "Number of local SSDs per primary worker for shuffle/scratch space."
type = number
default = 0
}
variable "secondary_worker_preemptibility" {
description = "Preemptibility of secondary workers: PREEMPTIBLE, SPOT, or NON_PREEMPTIBLE."
type = string
default = "SPOT"
validation {
condition = contains(["PREEMPTIBLE", "SPOT", "NON_PREEMPTIBLE"], var.secondary_worker_preemptibility)
error_message = "secondary_worker_preemptibility must be PREEMPTIBLE, SPOT, or NON_PREEMPTIBLE."
}
}
variable "autoscaling" {
description = "Autoscaling policy for the secondary worker pool. Set max_secondary_workers = 0 to disable autoscaling entirely."
type = object({
min_secondary_workers = optional(number, 0)
max_secondary_workers = optional(number, 0)
cooldown_period = optional(string, "120s")
graceful_decommission_timeout = optional(string, "300s")
scale_up_factor = optional(number, 0.5)
scale_down_factor = optional(number, 1.0)
})
default = {}
validation {
condition = var.autoscaling.max_secondary_workers >= var.autoscaling.min_secondary_workers
error_message = "autoscaling.max_secondary_workers must be >= min_secondary_workers."
}
}
variable "image_version" {
description = "Dataproc image version, e.g. 2.2-debian12. Pin it; do not float on 'latest'."
type = string
default = "2.2-debian12"
}
variable "optional_components" {
description = "Optional components to install (e.g. JUPYTER, ZEPPELIN, HIVE_WEBHCAT, FLINK, TRINO)."
type = list(string)
default = []
}
variable "cluster_properties" {
description = "Map of Dataproc/Hadoop/Spark property overrides, e.g. {\"spark:spark.executor.memory\" = \"6g\"}."
type = map(string)
default = {}
}
variable "enable_component_gateway" {
description = "Enable the Component Gateway to reach Spark/YARN/Jupyter UIs without SSH tunnels."
type = bool
default = true
}
variable "kms_key_name" {
description = "Full resource ID of a Cloud KMS key for CMEK (projects/.../cryptoKeys/...). Null = Google-managed keys."
type = string
default = null
}
variable "idle_delete_ttl" {
description = "Auto-delete the cluster after this idle duration, e.g. \"1800s\". Null = never auto-delete."
type = string
default = null
}
variable "create_staging_bucket" {
description = "Create a dedicated staging/temp bucket. When false, supply staging_bucket_name."
type = bool
default = true
}
variable "staging_bucket_name" {
description = "Name of an existing staging bucket. Required when create_staging_bucket = false."
type = string
default = null
}
variable "staging_bucket_age_days" {
description = "Lifecycle age (days) after which staging objects are deleted."
type = number
default = 14
}
variable "labels" {
description = "Additional labels merged onto the cluster, policy, and bucket."
type = map(string)
default = {}
}
outputs.tf
output "cluster_id" {
description = "Fully qualified Dataproc cluster ID (projects/<p>/regions/<r>/clusters/<n>)."
value = google_dataproc_cluster.this.id
}
output "cluster_name" {
description = "Name of the Dataproc cluster."
value = google_dataproc_cluster.this.name
}
output "region" {
description = "Region the cluster runs in."
value = google_dataproc_cluster.this.region
}
output "staging_bucket" {
description = "Staging/temp bucket name used by the cluster."
value = google_dataproc_cluster.this.cluster_config[0].staging_bucket
}
output "master_instance_names" {
description = "Compute Engine instance names of the master node(s)."
value = google_dataproc_cluster.this.cluster_config[0].master_config[0].instance_names
}
output "http_ports" {
description = "Map of Component Gateway HTTP endpoints (YARN, Spark History, Jupyter, etc.) when enabled."
value = try(google_dataproc_cluster.this.cluster_config[0].endpoint_config[0].http_ports, {})
}
output "autoscaling_policy_id" {
description = "ID of the autoscaling policy attached to the cluster, or null when autoscaling is disabled."
value = try(google_dataproc_autoscaling_policy.this[0].id, null)
}
How to use it
module "dataproc" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataproc?ref=v1.0.0"
project_id = "kv-data-prod"
region = "europe-west1"
cluster_name = "etl-spark-prod"
environment = "prod"
# Private networking — no external IPs; relies on Cloud NAT + Private Google Access.
subnetwork = "projects/kv-net-prod/regions/europe-west1/subnetworks/dataproc-priv"
internal_ip_only = true
service_account = "dataproc-etl@kv-data-prod.iam.gserviceaccount.com"
network_tags = ["dataproc", "egress-nat"]
# Cluster shape.
num_masters = 1
num_workers = 3
worker_machine_type = "n2-standard-8"
# Grow up to 20 SPOT secondary workers under YARN pressure, drain after 5 minutes idle.
autoscaling = {
min_secondary_workers = 0
max_secondary_workers = 20
graceful_decommission_timeout = "300s"
}
image_version = "2.2-debian12"
optional_components = ["JUPYTER"]
cluster_properties = {
"spark:spark.dynamicAllocation.enabled" = "true"
"spark:spark.sql.shuffle.partitions" = "400"
}
# Customer-managed encryption for PD + metadata.
kms_key_name = "projects/kv-sec-prod/locations/europe-west1/keyRings/dataproc/cryptoKeys/cluster-cmek"
labels = {
team = "data-platform"
cost-center = "4412"
}
}
# Downstream: submit a PySpark batch job to the cluster created above,
# referencing the module's cluster_name output.
resource "google_dataproc_job" "nightly_etl" {
project = "kv-data-prod"
region = module.dataproc.region
placement {
cluster_name = module.dataproc.cluster_name
}
pyspark_config {
main_python_file_uri = "gs://kv-data-prod-jobs/etl/nightly_load.py"
args = ["--date", "2026-06-09"]
properties = {
"spark.executor.cores" = "4"
}
}
labels = { pipeline = "nightly-etl" }
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "gcs"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...gcs state bucket/container + key per path...
}
}
2. Module config — live/prod/dataproc/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataproc?ref=v1.0.0"
}
inputs = {
project_id = "..."
region = "..."
cluster_name = "..."
environment = "..."
subnetwork = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/dataproc && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| project_id | string | — | yes | GCP project ID where the cluster is created. |
| region | string | — | yes | Dataproc region; must match the subnetwork region. |
| zone | string | "" |
no | Specific zone, or empty for Dataproc Auto Zone placement. |
| cluster_name | string | — | yes | Cluster name; lowercased, ^[a-z]([-a-z0-9]*[a-z0-9])?$, ≤ 51 chars. |
| environment | string | — | yes | One of dev, stg, prod; drives staging-bucket force_destroy. |
| subnetwork | string | — | yes | Subnetwork self-link/name; required for internal_ip_only. |
| internal_ip_only | bool | true |
no | Disable external IPs on cluster VMs. |
| service_account | string | null |
no | VM service account; null uses the default Compute SA. |
| network_tags | list(string) | [] |
no | Network tags for firewall targeting. |
| num_masters | number | 1 |
no | 1 (standard) or 3 (HA). |
| master_machine_type | string | n2-standard-4 |
no | Master machine type. |
| master_disk_type | string | pd-balanced |
no | Master boot disk type. |
| master_disk_size_gb | number | 100 |
no | Master boot disk size (≥ 30). |
| num_workers | number | 2 |
no | Primary workers (≥ 2 for HDFS replication). |
| worker_machine_type | string | n2-standard-4 |
no | Primary worker machine type. |
| worker_disk_type | string | pd-balanced |
no | Worker boot disk type. |
| worker_disk_size_gb | number | 200 |
no | Worker boot disk size (≥ 30). |
| worker_num_local_ssds | number | 0 |
no | Local SSDs per worker for shuffle/scratch. |
| secondary_worker_preemptibility | string | SPOT |
no | PREEMPTIBLE, SPOT, or NON_PREEMPTIBLE. |
| autoscaling | object | {} |
no | Secondary-pool autoscaling; max_secondary_workers = 0 disables it. |
| image_version | string | 2.2-debian12 |
no | Pinned Dataproc image version. |
| optional_components | list(string) | [] |
no | Optional components (JUPYTER, ZEPPELIN, FLINK, TRINO, …). |
| cluster_properties | map(string) | {} |
no | Dataproc/Spark/Hadoop property overrides. |
| enable_component_gateway | bool | true |
no | Expose Spark/YARN/Jupyter UIs via Component Gateway. |
| kms_key_name | string | null |
no | Cloud KMS key for CMEK; null = Google-managed keys. |
| idle_delete_ttl | string | null |
no | Auto-delete after idle duration (e.g. 1800s). |
| create_staging_bucket | bool | true |
no | Create a dedicated staging/temp bucket. |
| staging_bucket_name | string | null |
no | Existing bucket name when not creating one. |
| staging_bucket_age_days | number | 14 |
no | Lifecycle age for staging objects. |
| labels | map(string) | {} |
no | Extra labels merged onto cluster, policy, bucket. |
Outputs
| Name | Description |
|---|---|
| cluster_id | Fully qualified cluster ID (projects/<p>/regions/<r>/clusters/<n>). |
| cluster_name | Name of the Dataproc cluster. |
| region | Region the cluster runs in. |
| staging_bucket | Staging/temp bucket name used by the cluster. |
| master_instance_names | Compute Engine instance names of the master node(s). |
| http_ports | Map of Component Gateway HTTP endpoints (YARN, Spark History, Jupyter) when enabled. |
| autoscaling_policy_id | ID of the attached autoscaling policy, or null when disabled. |
Enterprise scenario
A retail analytics team runs nightly Spark ETL that joins clickstream and point-of-sale data sitting in BigQuery and GCS. They instantiate this module once per environment with num_workers = 3 baseline and an autoscaling policy that bursts to 20 SPOT secondary workers during the 02:00 load window, then drains them with a 300-second graceful decommission so no in-flight shuffle is lost. Because idle_delete_ttl is set on the dev cluster and SPOT workers cost roughly 60–80% less than on-demand, the platform team cut their Dataproc spend by more than half while keeping prod on a CMEK-encrypted, private-IP cluster that satisfies the security team’s audit requirements.
Best practices
- Lock the cluster down by network and identity. Keep
internal_ip_only = true, pass a dedicated least-privilegeservice_account, and never run on the default Compute Engine SA — it has broad project access that a compromised Spark job could abuse. Pair private IPs with Cloud NAT and Private Google Access so jobs can still reach GCS/BigQuery. - Use SPOT secondary workers for autoscaling, primary workers for stability. HDFS data nodes live on primary workers (keep
num_workers >= 2); let the preemptible/SPOT pool absorb burst compute, and set agraceful_decommission_timeoutso YARN drains containers before nodes are reclaimed. - Pin the image version and prefer ephemeral clusters. Floating on
latestsilently changes Spark/Hadoop versions between runs; pin something like2.2-debian12. For batch pipelines, setidle_delete_ttl(or recreate per-job) so you only pay while jobs run. - Own your encryption and staging. Supply a
kms_key_namefor CMEK on persistent disks and metadata, and let the module create a single staging bucket with a lifecycle rule instead of letting Dataproc scatter region-default buckets you forget to clean up. - Right-size disks and use local SSDs for shuffle-heavy jobs. Shuffle spills to the boot disk by default; bump
worker_disk_size_gbor addworker_num_local_ssdsfor wide joins and large sorts to avoidpd-balancedthroughput becoming the bottleneck. - Label everything for cost allocation. The module stamps
managed-by,module, andenv; addteamandcost-centervialabelsso Dataproc VM and bucket spend is attributable in billing exports.