IaC GCP

Terraform Module: GCP Document AI — reusable, regional document processors with KMS and IAM baked in

Quick take — Provision Google Cloud Document AI processors with Terraform: a var-driven module wrapping google_document_ai_processor with default version pinning, CMEK encryption, and least-privilege IAM. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "google" {
  project = "my-project"
  region  = "us-central1"
}

module "document_ai" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-document-ai?ref=v1.0.0"

  project_id     = "..."  # GCP project ID that owns the processor.
  display_name   = "..."  # Human-readable name (1–64 chars) shown in the console.
  processor_type = "..."  # Processor type, e.g. `INVOICE_PROCESSOR`, `FORM_PARSER_…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Google Cloud Document AI is a managed ML service that turns unstructured documents — invoices, contracts, IDs, W-2s, bank statements — into structured data. You create a processor (for example an Invoice Parser, a Form Parser, or a custom Document Extractor you trained), and then call its process / batchProcess endpoints to extract entities, key-value pairs, tables, and full OCR text. Processors are regional (us, eu, or a specific region like us-central1) and the region is immutable once created, so the location decision is effectively permanent and worth encoding as code.

The catch is that a single processor is rarely a complete unit of deployment. In production you also care about which processor version is the live default (Google ships new versions and auto-upgrades unless you pin one), whether documents at rest are encrypted with your own Cloud KMS key (CMEK) instead of Google-managed keys, and which service accounts are allowed to invoke the processor versus administer it. Click-ops in the console gets none of that reproducibly, and it does not survive a region migration or a clean-room rebuild in a second project.

This module wraps google_document_ai_processor together with the two things that almost always travel with it — a pinned default processor version (google_document_ai_processor_default_version) and least-privilege IAM bindings on the processor — behind a small, var-driven interface. You hand it a type, a location, and an optional KMS key, and you get back a stable processor id and name that the rest of your stack (Cloud Functions, Workflows, Eventarc, Cloud Run jobs) can reference without hardcoding the auto-generated processor ID.

When to use it

If you only need a one-off processor for a manual experiment in the console, a module is overkill — create it by hand. Reach for this once the processor is a dependency other infrastructure relies on.

Module structure

terraform-module-gcp-document-ai/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # processor, pinned default version, IAM bindings
├── variables.tf     # typed, validated inputs
└── outputs.tf       # processor id/name + endpoint pieces

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # Set of principals that may *invoke* the processor (process / batchProcess).
  # documentai.apiUser grants processors.processOnline + processors.processBatch.
  user_bindings = {
    for member in var.processor_user_members :
    member => member
  }
}

# The Document AI processor itself. Location is immutable post-creation.
resource "google_document_ai_processor" "this" {
  project      = var.project_id
  location     = var.location
  display_name = var.display_name
  type         = var.processor_type

  # Optional CMEK: encrypt processed documents at rest with your own key.
  kms_key_name = var.kms_key_name
}

# Pin the live default version so extraction output is stable and reproducible.
# Only created when an explicit version is supplied; otherwise Google manages it.
resource "google_document_ai_processor_default_version" "pinned" {
  count = var.default_processor_version != null ? 1 : 0

  processor = google_document_ai_processor.this.id
  version = "${google_document_ai_processor.this.id}/processorVersions/${var.default_processor_version}"
}

# Least-privilege: identities allowed to call the processor at runtime.
resource "google_project_iam_member" "user" {
  for_each = local.user_bindings

  project = var.project_id
  role    = "roles/documentai.apiUser"
  member  = each.value
}

# Optional: identities allowed to administer the processor (create versions, etc.).
resource "google_project_iam_member" "admin" {
  for_each = toset(var.processor_admin_members)

  project = var.project_id
  role    = "roles/documentai.editor"
  member  = each.value
}

variables.tf

variable "project_id" {
  description = "GCP project ID that owns the Document AI processor."
  type        = string
}

variable "location" {
  description = "Document AI multi-region or region. Immutable after creation."
  type        = string
  default     = "us"

  validation {
    condition     = contains(["us", "eu"], var.location) || can(regex("^[a-z]+-[a-z]+[0-9]$", var.location))
    error_message = "location must be 'us', 'eu', or a region like 'us-central1' / 'europe-west4'."
  }
}

variable "display_name" {
  description = "Human-readable processor name shown in the console."
  type        = string

  validation {
    condition     = length(var.display_name) >= 1 && length(var.display_name) <= 64
    error_message = "display_name must be between 1 and 64 characters."
  }
}

variable "processor_type" {
  description = "Processor type, e.g. INVOICE_PROCESSOR, FORM_PARSER_PROCESSOR, OCR_PROCESSOR, CUSTOM_EXTRACTION_PROCESSOR. Run 'gcloud documentai processor-types list' for the valid set in your location."
  type        = string

  validation {
    condition     = can(regex("^[A-Z][A-Z0-9_]+$", var.processor_type))
    error_message = "processor_type must be an upper-snake-case identifier such as INVOICE_PROCESSOR."
  }
}

variable "kms_key_name" {
  description = "Optional Cloud KMS CryptoKey resource ID for CMEK at rest. Must live in the same location as the processor. Null = Google-managed encryption."
  type        = string
  default     = null

  validation {
    condition     = var.kms_key_name == null || can(regex("^projects/.+/locations/.+/keyRings/.+/cryptoKeys/.+$", var.kms_key_name))
    error_message = "kms_key_name must be a full CryptoKey resource ID: projects/.../locations/.../keyRings/.../cryptoKeys/..."
  }
}

variable "default_processor_version" {
  description = "Optional processor version ID (the short ID, e.g. 'pretrained-invoice-v1.3-2022-07-15') to pin as the default. Null lets Google manage the default version."
  type        = string
  default     = null
}

variable "processor_user_members" {
  description = "IAM members granted roles/documentai.apiUser (runtime invoke), e.g. [\"serviceAccount:invoke-sa@proj.iam.gserviceaccount.com\"]."
  type        = list(string)
  default     = []
}

variable "processor_admin_members" {
  description = "IAM members granted roles/documentai.editor (manage processor & versions)."
  type        = list(string)
  default     = []
}

outputs.tf

output "id" {
  description = "Fully-qualified processor resource ID: projects/{project}/locations/{location}/processors/{processor}."
  value       = google_document_ai_processor.this.id
}

output "name" {
  description = "Server-assigned processor name (the short processor ID segment)."
  value       = google_document_ai_processor.this.name
}

output "display_name" {
  description = "Human-readable display name of the processor."
  value       = google_document_ai_processor.this.display_name
}

output "location" {
  description = "Region/multi-region the processor lives in (immutable)."
  value       = google_document_ai_processor.this.location
}

output "process_endpoint" {
  description = "Regional REST endpoint for online processing of this processor."
  value       = "https://${google_document_ai_processor.this.location}-documentai.googleapis.com/v1/${google_document_ai_processor.this.id}:process"
}

output "default_version_pinned" {
  description = "Whether an explicit default processor version is pinned by this module."
  value       = var.default_processor_version != null
}

How to use it

module "document_ai" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-document-ai?ref=v1.0.0"

  project_id   = "kloudvin-docpipeline-prod"
  location     = "eu" # keep invoice data in the EU multi-region
  display_name = "ap-invoice-parser-prod"
  processor_type = "INVOICE_PROCESSOR"

  # Pin the model so extracted fields never change without a code review.
  default_processor_version = "pretrained-invoice-v1.3-2022-07-15"

  # CMEK at rest, key co-located in the EU.
  kms_key_name = "projects/kloudvin-docpipeline-prod/locations/europe-west4/keyRings/docai/cryptoKeys/docai-cmek"

  # Runtime function invokes the processor; CI service account administers it.
  processor_user_members  = ["serviceAccount:invoice-fn@kloudvin-docpipeline-prod.iam.gserviceaccount.com"]
  processor_admin_members = ["serviceAccount:terraform-ci@kloudvin-docpipeline-prod.iam.gserviceaccount.com"]
}

# Downstream: a Cloud Function that calls the processor, wired via module outputs.
resource "google_cloudfunctions2_function" "invoice_extractor" {
  name     = "invoice-extractor"
  location = "europe-west4"
  project  = "kloudvin-docpipeline-prod"

  build_config {
    runtime     = "python312"
    entry_point = "extract"
    source {
      storage_source {
        bucket = "kloudvin-fn-source"
        object = "invoice-extractor.zip"
      }
    }
  }

  service_config {
    service_account_email = "invoice-fn@kloudvin-docpipeline-prod.iam.gserviceaccount.com"

    environment_variables = {
      # The function reads these instead of hardcoding the generated processor ID.
      DOCAI_PROCESSOR_ID = module.document_ai.id
      DOCAI_ENDPOINT     = module.document_ai.process_endpoint
    }
  }
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "gcs"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...gcs state bucket/container + key per path...
  }
}

2. Module configlive/prod/document_ai/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-document-ai?ref=v1.0.0"
}

inputs = {
  project_id = "..."
  display_name = "..."
  processor_type = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/document_ai && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
project_id string Yes GCP project ID that owns the processor.
location string "us" No Multi-region (us/eu) or region (e.g. us-central1). Immutable after creation.
display_name string Yes Human-readable name (1–64 chars) shown in the console.
processor_type string Yes Processor type, e.g. INVOICE_PROCESSOR, FORM_PARSER_PROCESSOR, OCR_PROCESSOR.
kms_key_name string null No Full Cloud KMS CryptoKey ID for CMEK at rest; must be co-located with the processor.
default_processor_version string null No Short version ID to pin as default; null lets Google manage it.
processor_user_members list(string) [] No Members granted roles/documentai.apiUser (runtime invoke).
processor_admin_members list(string) [] No Members granted roles/documentai.editor (manage processor/versions).

Outputs

Name Description
id Fully-qualified processor resource ID (projects/.../locations/.../processors/...).
name Server-assigned short processor ID segment.
display_name Human-readable display name of the processor.
location Region/multi-region the processor lives in.
process_endpoint Regional REST endpoint for online :process calls.
default_version_pinned Boolean indicating whether an explicit default version is pinned.

Enterprise scenario

A pan-European insurer ingests roughly 40,000 scanned claim invoices per day. An upload to a regional Cloud Storage bucket fires Eventarc, which triggers a Cloud Run job that calls this module’s INVOICE_PROCESSOR (deployed to the eu multi-region with CMEK so no document leaves EU jurisdiction or is encrypted with a key the insurer doesn’t control). Because the module pins default_processor_version, the finance team’s downstream reconciliation logic stays stable through Google’s quarterly model refreshes, and any version bump is a reviewed pull request rather than a silent change. The same Terraform definition is reused verbatim in a staging project against synthetic invoices, giving the team a faithful pre-prod rehearsal of extraction accuracy.

Best practices

TerraformGCPDocument AIModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading