Terraform Module: GCP Dataproc — production-ready Spark/Hadoop clusters with autoscaling and CMEK

Quick take — A reusable hashicorp/google Terraform module for google_dataproc_cluster: private clusters, autoscaling policies, component gateway, preemptible secondary workers, CMEK encryption and staging buckets. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "dataproc" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataproc?ref=v1.0.0"

  project_id   = "..."  # GCP project ID where the cluster is created.
  region       = "..."  # Dataproc region; must match the subnetwork region.
  cluster_name = "..."  # Cluster name; lowercased, `^[a-z]([-a-z0-9]*[a-z0-9])?$…
  environment  = "..."  # One of `dev`, `stg`, `prod`; drives staging-bucket `for…
  subnetwork   = "..."  # Subnetwork self-link/name; required for `internal_ip_on…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Cloud Dataproc is GCP’s managed Apache Spark, Hadoop, Hive, Presto and Flink service. You hand it a cluster shape — a master, some workers, a region — and it provisions the VMs, installs the open-source stack, wires up YARN/HDFS, and lets you submit jobs over the Dataproc Jobs API or the Component Gateway UIs. The catch is that a production Dataproc cluster is never just a master and two workers. It needs a deterministic staging and temp bucket (otherwise Dataproc auto-creates one per region and you lose track of cost), an autoscaling policy so the secondary worker pool grows under load and shrinks when idle, a private network with internal_ip_only = true so nodes never get external IPs, CMEK so the persistent disks and bucket are encrypted with your own KMS key, and a service account scoped to least privilege instead of the default Compute Engine SA.

Wiring all of that by hand in every project is where drift and copy-paste bugs creep in. This module wraps google_dataproc_cluster (plus an optional google_dataproc_autoscaling_policy) so a team declares the intent — “a private, autoscaling, encrypted Spark cluster in europe-west1” — in a dozen lines, and the module enforces the safe defaults: image version pinning, ephemeral cluster TTL, shielded VMs, and a single staging bucket.

When to use it

You run batch Spark/PySpark or Hive jobs on GCP and want repeatable, ephemeral-or-long-lived clusters provisioned through IaC rather than the gcloud dataproc clusters create flag soup.
You need autoscaling so transient ETL workloads spin secondary (preemptible) workers up and down without a human resizing the cluster.
You operate in a regulated or security-conscious environment that mandates private IPs, customer-managed encryption keys (CMEK), and non-default service accounts.
You want a standard staging/temp bucket lifecycle and consistent labelling for cost allocation across many teams’ clusters.
Reach for Dataproc Serverless (google_dataproc_batch) instead if you only run short batch jobs and never need a persistent cluster or HDFS — this module is for the managed cluster model.

Module structure

terraform-module-gcp-dataproc/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # autoscaling policy + dataproc cluster + staging bucket
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # cluster id/name + HTTP ports, bucket, policy id

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Dataproc requires lowercase names matching ^[a-z]([-a-z0-9]*[a-z0-9])?$
  cluster_name = lower(var.cluster_name)

  # Build the optional autoscaling block only when a policy is requested.
  autoscaling_enabled = var.autoscaling.max_secondary_workers > 0

  labels = merge(
    {
      managed-by = "terraform"
      module     = "gcp-dataproc"
      env        = var.environment
    },
    var.labels,
  )
}

# Deterministic staging/temp bucket so Dataproc does not auto-create one per region.
resource "google_storage_bucket" "staging" {
  count = var.create_staging_bucket ? 1 : 0

  name                        = coalesce(var.staging_bucket_name, "${local.cluster_name}-${var.project_id}-dpstage")
  project                     = var.project_id
  location                    = var.region
  storage_class               = "STANDARD"
  uniform_bucket_level_access = true
  force_destroy               = var.environment != "prod"

  lifecycle_rule {
    condition {
      age = var.staging_bucket_age_days
    }
    action {
      type = "Delete"
    }
  }

  labels = local.labels
}

# Autoscaling policy governing the secondary (preemptible) worker pool.
resource "google_dataproc_autoscaling_policy" "this" {
  count = local.autoscaling_enabled ? 1 : 0

  policy_id = "${local.cluster_name}-asp"
  project   = var.project_id
  location  = var.region

  worker_config {
    min_instances = var.num_workers
    max_instances = var.num_workers
  }

  secondary_worker_config {
    min_instances = var.autoscaling.min_secondary_workers
    max_instances = var.autoscaling.max_secondary_workers
    weight        = 1
  }

  basic_algorithm {
    cooldown_period = var.autoscaling.cooldown_period

    yarn_config {
      graceful_decommission_timeout = var.autoscaling.graceful_decommission_timeout
      scale_up_factor               = var.autoscaling.scale_up_factor
      scale_down_factor             = var.autoscaling.scale_down_factor
      scale_up_min_worker_fraction   = 0.0
      scale_down_min_worker_fraction = 0.0
    }
  }
}

resource "google_dataproc_cluster" "this" {
  name    = local.cluster_name
  project = var.project_id
  region  = var.region
  labels  = local.labels

  # Optional graceful shutdown of ephemeral clusters.
  dynamic "cluster_config" {
    for_each = [1]
    content {
      staging_bucket = var.create_staging_bucket ? google_storage_bucket.staging[0].name : var.staging_bucket_name

      gce_cluster_config {
        zone                   = var.zone
        subnetwork             = var.subnetwork
        internal_ip_only       = var.internal_ip_only
        service_account        = var.service_account
        service_account_scopes = ["cloud-platform"]
        tags                   = var.network_tags

        shielded_instance_config {
          enable_secure_boot          = true
          enable_vtpm                 = true
          enable_integrity_monitoring = true
        }
      }

      master_config {
        num_instances = var.num_masters
        machine_type  = var.master_machine_type

        disk_config {
          boot_disk_type    = var.master_disk_type
          boot_disk_size_gb = var.master_disk_size_gb
        }
      }

      worker_config {
        num_instances = var.num_workers
        machine_type  = var.worker_machine_type

        disk_config {
          boot_disk_type    = var.worker_disk_type
          boot_disk_size_gb = var.worker_disk_size_gb
          num_local_ssds    = var.worker_num_local_ssds
        }
      }

      # Preemptible/spot secondary workers, sized by the autoscaling policy.
      dynamic "preemptible_worker_config" {
        for_each = local.autoscaling_enabled ? [1] : []
        content {
          num_instances  = var.autoscaling.min_secondary_workers
          preemptibility = var.secondary_worker_preemptibility

          disk_config {
            boot_disk_type    = var.worker_disk_type
            boot_disk_size_gb = var.worker_disk_size_gb
          }
        }
      }

      software_config {
        image_version       = var.image_version
        optional_components = var.optional_components

        override_properties = var.cluster_properties
      }

      # Attach the autoscaling policy when one was created.
      dynamic "autoscaling_config" {
        for_each = local.autoscaling_enabled ? [1] : []
        content {
          policy_uri = google_dataproc_autoscaling_policy.this[0].id
        }
      }

      # Component Gateway exposes the Spark/YARN/Jupyter web UIs over an IAP-aware proxy.
      endpoint_config {
        enable_http_port_access = var.enable_component_gateway
      }

      # CMEK: encrypt PD and the cluster's metadata with a customer key.
      dynamic "encryption_config" {
        for_each = var.kms_key_name == null ? [] : [1]
        content {
          kms_key_name = var.kms_key_name
        }
      }

      # Auto-delete idle ephemeral clusters to cap cost.
      dynamic "lifecycle_config" {
        for_each = var.idle_delete_ttl == null ? [] : [1]
        content {
          idle_delete_ttl = var.idle_delete_ttl
        }
      }
    }
  }

  timeouts {
    create = "45m"
    update = "45m"
    delete = "45m"
  }
}

variables.tf

variable "project_id" {
  description = "GCP project ID where the Dataproc cluster is created."
  type        = string
}

variable "region" {
  description = "Dataproc region, e.g. europe-west1. Must match the subnetwork region."
  type        = string
}

variable "zone" {
  description = "Specific zone for cluster VMs. Leave empty for Dataproc Auto Zone placement."
  type        = string
  default     = ""
}

variable "cluster_name" {
  description = "Cluster name. Lowercased; must match ^[a-z]([-a-z0-9]*[a-z0-9])?$ and be <= 51 chars."
  type        = string

  validation {
    condition     = can(regex("^[a-z]([-a-z0-9]*[a-z0-9])?$", lower(var.cluster_name))) && length(var.cluster_name) <= 51
    error_message = "cluster_name must be <= 51 chars, lowercase letters/digits/hyphens, starting with a letter."
  }
}

variable "environment" {
  description = "Environment label (dev/stg/prod). Drives force_destroy on the staging bucket."
  type        = string

  validation {
    condition     = contains(["dev", "stg", "prod"], var.environment)
    error_message = "environment must be one of: dev, stg, prod."
  }
}

variable "subnetwork" {
  description = "Self-link or short name of the subnetwork. Required when internal_ip_only = true."
  type        = string
}

variable "internal_ip_only" {
  description = "When true, cluster VMs get no external IPs (requires Private Google Access / Cloud NAT)."
  type        = bool
  default     = true
}

variable "service_account" {
  description = "Service account email for cluster VMs. Null uses the default Compute Engine SA (not recommended)."
  type        = string
  default     = null
}

variable "network_tags" {
  description = "Network tags applied to cluster VMs for firewall targeting."
  type        = list(string)
  default     = []
}

variable "num_masters" {
  description = "Number of masters. Use 1 for standard, 3 for High Availability mode."
  type        = number
  default     = 1

  validation {
    condition     = contains([1, 3], var.num_masters)
    error_message = "num_masters must be 1 (standard) or 3 (HA)."
  }
}

variable "master_machine_type" {
  description = "Machine type for the master node(s)."
  type        = string
  default     = "n2-standard-4"
}

variable "master_disk_type" {
  description = "Boot disk type for masters (pd-standard, pd-balanced, pd-ssd)."
  type        = string
  default     = "pd-balanced"
}

variable "master_disk_size_gb" {
  description = "Boot disk size (GB) for masters. Minimum 30."
  type        = number
  default     = 100

  validation {
    condition     = var.master_disk_size_gb >= 30
    error_message = "master_disk_size_gb must be at least 30."
  }
}

variable "num_workers" {
  description = "Number of primary (non-preemptible) workers. Minimum 2 for HDFS replication."
  type        = number
  default     = 2

  validation {
    condition     = var.num_workers >= 2
    error_message = "num_workers must be at least 2 so HDFS can replicate blocks."
  }
}

variable "worker_machine_type" {
  description = "Machine type for primary workers."
  type        = string
  default     = "n2-standard-4"
}

variable "worker_disk_type" {
  description = "Boot disk type for workers (pd-standard, pd-balanced, pd-ssd)."
  type        = string
  default     = "pd-balanced"
}

variable "worker_disk_size_gb" {
  description = "Boot disk size (GB) for workers. Minimum 30."
  type        = number
  default     = 200

  validation {
    condition     = var.worker_disk_size_gb >= 30
    error_message = "worker_disk_size_gb must be at least 30."
  }
}

variable "worker_num_local_ssds" {
  description = "Number of local SSDs per primary worker for shuffle/scratch space."
  type        = number
  default     = 0
}

variable "secondary_worker_preemptibility" {
  description = "Preemptibility of secondary workers: PREEMPTIBLE, SPOT, or NON_PREEMPTIBLE."
  type        = string
  default     = "SPOT"

  validation {
    condition     = contains(["PREEMPTIBLE", "SPOT", "NON_PREEMPTIBLE"], var.secondary_worker_preemptibility)
    error_message = "secondary_worker_preemptibility must be PREEMPTIBLE, SPOT, or NON_PREEMPTIBLE."
  }
}

variable "autoscaling" {
  description = "Autoscaling policy for the secondary worker pool. Set max_secondary_workers = 0 to disable autoscaling entirely."
  type = object({
    min_secondary_workers         = optional(number, 0)
    max_secondary_workers         = optional(number, 0)
    cooldown_period               = optional(string, "120s")
    graceful_decommission_timeout = optional(string, "300s")
    scale_up_factor               = optional(number, 0.5)
    scale_down_factor             = optional(number, 1.0)
  })
  default = {}

  validation {
    condition     = var.autoscaling.max_secondary_workers >= var.autoscaling.min_secondary_workers
    error_message = "autoscaling.max_secondary_workers must be >= min_secondary_workers."
  }
}

variable "image_version" {
  description = "Dataproc image version, e.g. 2.2-debian12. Pin it; do not float on 'latest'."
  type        = string
  default     = "2.2-debian12"
}

variable "optional_components" {
  description = "Optional components to install (e.g. JUPYTER, ZEPPELIN, HIVE_WEBHCAT, FLINK, TRINO)."
  type        = list(string)
  default     = []
}

variable "cluster_properties" {
  description = "Map of Dataproc/Hadoop/Spark property overrides, e.g. {\"spark:spark.executor.memory\" = \"6g\"}."
  type        = map(string)
  default     = {}
}

variable "enable_component_gateway" {
  description = "Enable the Component Gateway to reach Spark/YARN/Jupyter UIs without SSH tunnels."
  type        = bool
  default     = true
}

variable "kms_key_name" {
  description = "Full resource ID of a Cloud KMS key for CMEK (projects/.../cryptoKeys/...). Null = Google-managed keys."
  type        = string
  default     = null
}

variable "idle_delete_ttl" {
  description = "Auto-delete the cluster after this idle duration, e.g. \"1800s\". Null = never auto-delete."
  type        = string
  default     = null
}

variable "create_staging_bucket" {
  description = "Create a dedicated staging/temp bucket. When false, supply staging_bucket_name."
  type        = bool
  default     = true
}

variable "staging_bucket_name" {
  description = "Name of an existing staging bucket. Required when create_staging_bucket = false."
  type        = string
  default     = null
}

variable "staging_bucket_age_days" {
  description = "Lifecycle age (days) after which staging objects are deleted."
  type        = number
  default     = 14
}

variable "labels" {
  description = "Additional labels merged onto the cluster, policy, and bucket."
  type        = map(string)
  default     = {}
}

outputs.tf

output "cluster_id" {
  description = "Fully qualified Dataproc cluster ID (projects/<p>/regions/<r>/clusters/<n>)."
  value       = google_dataproc_cluster.this.id
}

output "cluster_name" {
  description = "Name of the Dataproc cluster."
  value       = google_dataproc_cluster.this.name
}

output "region" {
  description = "Region the cluster runs in."
  value       = google_dataproc_cluster.this.region
}

output "staging_bucket" {
  description = "Staging/temp bucket name used by the cluster."
  value       = google_dataproc_cluster.this.cluster_config[0].staging_bucket
}

output "master_instance_names" {
  description = "Compute Engine instance names of the master node(s)."
  value       = google_dataproc_cluster.this.cluster_config[0].master_config[0].instance_names
}

output "http_ports" {
  description = "Map of Component Gateway HTTP endpoints (YARN, Spark History, Jupyter, etc.) when enabled."
  value       = try(google_dataproc_cluster.this.cluster_config[0].endpoint_config[0].http_ports, {})
}

output "autoscaling_policy_id" {
  description = "ID of the autoscaling policy attached to the cluster, or null when autoscaling is disabled."
  value       = try(google_dataproc_autoscaling_policy.this[0].id, null)
}

How to use it

module "dataproc" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataproc?ref=v1.0.0"

  project_id   = "kv-data-prod"
  region       = "europe-west1"
  cluster_name = "etl-spark-prod"
  environment  = "prod"

  # Private networking — no external IPs; relies on Cloud NAT + Private Google Access.
  subnetwork       = "projects/kv-net-prod/regions/europe-west1/subnetworks/dataproc-priv"
  internal_ip_only = true
  service_account  = "dataproc-etl@kv-data-prod.iam.gserviceaccount.com"
  network_tags     = ["dataproc", "egress-nat"]

  # Cluster shape.
  num_masters         = 1
  num_workers         = 3
  worker_machine_type = "n2-standard-8"

  # Grow up to 20 SPOT secondary workers under YARN pressure, drain after 5 minutes idle.
  autoscaling = {
    min_secondary_workers         = 0
    max_secondary_workers         = 20
    graceful_decommission_timeout = "300s"
  }

  image_version       = "2.2-debian12"
  optional_components = ["JUPYTER"]
  cluster_properties = {
    "spark:spark.dynamicAllocation.enabled" = "true"
    "spark:spark.sql.shuffle.partitions"    = "400"
  }

  # Customer-managed encryption for PD + metadata.
  kms_key_name = "projects/kv-sec-prod/locations/europe-west1/keyRings/dataproc/cryptoKeys/cluster-cmek"

  labels = {
    team        = "data-platform"
    cost-center = "4412"
  }
}

# Downstream: submit a PySpark batch job to the cluster created above,
# referencing the module's cluster_name output.
resource "google_dataproc_job" "nightly_etl" {
  project = "kv-data-prod"
  region  = module.dataproc.region

  placement {
    cluster_name = module.dataproc.cluster_name
  }

  pyspark_config {
    main_python_file_uri = "gs://kv-data-prod-jobs/etl/nightly_load.py"
    args                 = ["--date", "2026-06-09"]
    properties = {
      "spark.executor.cores" = "4"
    }
  }

  labels = { pipeline = "nightly-etl" }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module config — live/prod/dataproc/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-dataproc?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  region = "..."
  cluster_name = "..."
  environment = "..."
  subnetwork = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/dataproc && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
project_id	string	—	yes	GCP project ID where the cluster is created.
region	string	—	yes	Dataproc region; must match the subnetwork region.
zone	string	`""`	no	Specific zone, or empty for Dataproc Auto Zone placement.
cluster_name	string	—	yes	Cluster name; lowercased, `^[a-z]([-a-z0-9]*[a-z0-9])?$`, ≤ 51 chars.
environment	string	—	yes	One of `dev`, `stg`, `prod`; drives staging-bucket `force_destroy`.
subnetwork	string	—	yes	Subnetwork self-link/name; required for `internal_ip_only`.
internal_ip_only	bool	`true`	no	Disable external IPs on cluster VMs.
service_account	string	`null`	no	VM service account; null uses the default Compute SA.
network_tags	list(string)	`[]`	no	Network tags for firewall targeting.
num_masters	number	`1`	no	`1` (standard) or `3` (HA).
master_machine_type	string	`n2-standard-4`	no	Master machine type.
master_disk_type	string	`pd-balanced`	no	Master boot disk type.
master_disk_size_gb	number	`100`	no	Master boot disk size (≥ 30).
num_workers	number	`2`	no	Primary workers (≥ 2 for HDFS replication).
worker_machine_type	string	`n2-standard-4`	no	Primary worker machine type.
worker_disk_type	string	`pd-balanced`	no	Worker boot disk type.
worker_disk_size_gb	number	`200`	no	Worker boot disk size (≥ 30).
worker_num_local_ssds	number	`0`	no	Local SSDs per worker for shuffle/scratch.
secondary_worker_preemptibility	string	`SPOT`	no	`PREEMPTIBLE`, `SPOT`, or `NON_PREEMPTIBLE`.
autoscaling	object	`{}`	no	Secondary-pool autoscaling; `max_secondary_workers = 0` disables it.
image_version	string	`2.2-debian12`	no	Pinned Dataproc image version.
optional_components	list(string)	`[]`	no	Optional components (JUPYTER, ZEPPELIN, FLINK, TRINO, …).
cluster_properties	map(string)	`{}`	no	Dataproc/Spark/Hadoop property overrides.
enable_component_gateway	bool	`true`	no	Expose Spark/YARN/Jupyter UIs via Component Gateway.
kms_key_name	string	`null`	no	Cloud KMS key for CMEK; null = Google-managed keys.
idle_delete_ttl	string	`null`	no	Auto-delete after idle duration (e.g. `1800s`).
create_staging_bucket	bool	`true`	no	Create a dedicated staging/temp bucket.
staging_bucket_name	string	`null`	no	Existing bucket name when not creating one.
staging_bucket_age_days	number	`14`	no	Lifecycle age for staging objects.
labels	map(string)	`{}`	no	Extra labels merged onto cluster, policy, bucket.

Outputs

Name	Description
cluster_id	Fully qualified cluster ID (`projects/<p>/regions/<r>/clusters/<n>`).
cluster_name	Name of the Dataproc cluster.
region	Region the cluster runs in.
staging_bucket	Staging/temp bucket name used by the cluster.
master_instance_names	Compute Engine instance names of the master node(s).
http_ports	Map of Component Gateway HTTP endpoints (YARN, Spark History, Jupyter) when enabled.
autoscaling_policy_id	ID of the attached autoscaling policy, or null when disabled.

Enterprise scenario

A retail analytics team runs nightly Spark ETL that joins clickstream and point-of-sale data sitting in BigQuery and GCS. They instantiate this module once per environment with num_workers = 3 baseline and an autoscaling policy that bursts to 20 SPOT secondary workers during the 02:00 load window, then drains them with a 300-second graceful decommission so no in-flight shuffle is lost. Because idle_delete_ttl is set on the dev cluster and SPOT workers cost roughly 60–80% less than on-demand, the platform team cut their Dataproc spend by more than half while keeping prod on a CMEK-encrypted, private-IP cluster that satisfies the security team’s audit requirements.

Best practices

Lock the cluster down by network and identity. Keep internal_ip_only = true, pass a dedicated least-privilege service_account, and never run on the default Compute Engine SA — it has broad project access that a compromised Spark job could abuse. Pair private IPs with Cloud NAT and Private Google Access so jobs can still reach GCS/BigQuery.
Use SPOT secondary workers for autoscaling, primary workers for stability. HDFS data nodes live on primary workers (keep num_workers >= 2); let the preemptible/SPOT pool absorb burst compute, and set a graceful_decommission_timeout so YARN drains containers before nodes are reclaimed.
Pin the image version and prefer ephemeral clusters. Floating on latest silently changes Spark/Hadoop versions between runs; pin something like 2.2-debian12. For batch pipelines, set idle_delete_ttl (or recreate per-job) so you only pay while jobs run.
Own your encryption and staging. Supply a kms_key_name for CMEK on persistent disks and metadata, and let the module create a single staging bucket with a lifecycle rule instead of letting Dataproc scatter region-default buckets you forget to clean up.
Right-size disks and use local SSDs for shuffle-heavy jobs. Shuffle spills to the boot disk by default; bump worker_disk_size_gb or add worker_num_local_ssds for wide joins and large sorts to avoid pd-balanced throughput becoming the bottleneck.
Label everything for cost allocation. The module stamps managed-by, module, and env; add team and cost-center via labels so Dataproc VM and bucket spend is attributable in billing exports.