Quick take — Provision Google Cloud Document AI processors with Terraform: a var-driven module wrapping google_document_ai_processor with default version pinning, CMEK encryption, and least-privilege IAM. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "google" {
project = "my-project"
region = "us-central1"
}
module "document_ai" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-document-ai?ref=v1.0.0"
project_id = "..." # GCP project ID that owns the processor.
display_name = "..." # Human-readable name (1–64 chars) shown in the console.
processor_type = "..." # Processor type, e.g. `INVOICE_PROCESSOR`, `FORM_PARSER_…
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Google Cloud Document AI is a managed ML service that turns unstructured documents — invoices, contracts, IDs, W-2s, bank statements — into structured data. You create a processor (for example an Invoice Parser, a Form Parser, or a custom Document Extractor you trained), and then call its process / batchProcess endpoints to extract entities, key-value pairs, tables, and full OCR text. Processors are regional (us, eu, or a specific region like us-central1) and the region is immutable once created, so the location decision is effectively permanent and worth encoding as code.
The catch is that a single processor is rarely a complete unit of deployment. In production you also care about which processor version is the live default (Google ships new versions and auto-upgrades unless you pin one), whether documents at rest are encrypted with your own Cloud KMS key (CMEK) instead of Google-managed keys, and which service accounts are allowed to invoke the processor versus administer it. Click-ops in the console gets none of that reproducibly, and it does not survive a region migration or a clean-room rebuild in a second project.
This module wraps google_document_ai_processor together with the two things that almost always travel with it — a pinned default processor version (google_document_ai_processor_default_version) and least-privilege IAM bindings on the processor — behind a small, var-driven interface. You hand it a type, a location, and an optional KMS key, and you get back a stable processor id and name that the rest of your stack (Cloud Functions, Workflows, Eventarc, Cloud Run jobs) can reference without hardcoding the auto-generated processor ID.
When to use it
- You run an automated document pipeline (Cloud Storage upload to Eventarc to Cloud Run/Functions to Document AI to BigQuery) and need the processor to exist before the consumers deploy.
- You must pin a processor version for compliance or reproducibility, so extraction output does not silently change when Google releases a new model.
- You need CMEK / customer-managed encryption on documents processed at rest to satisfy data-residency or key-control requirements.
- You stamp out the same processors across dev / staging / prod (or across regions for data residency) and want a single definition rather than three console clicks.
- You want least-privilege access: a runtime service account that can only call
documentai.processors.processBatch, separated from an admin identity that manages the processor.
If you only need a one-off processor for a manual experiment in the console, a module is overkill — create it by hand. Reach for this once the processor is a dependency other infrastructure relies on.
Module structure
terraform-module-gcp-document-ai/
├── versions.tf # provider + Terraform version pins
├── main.tf # processor, pinned default version, IAM bindings
├── variables.tf # typed, validated inputs
└── outputs.tf # processor id/name + endpoint pieces
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
main.tf
locals {
# Set of principals that may *invoke* the processor (process / batchProcess).
# documentai.apiUser grants processors.processOnline + processors.processBatch.
user_bindings = {
for member in var.processor_user_members :
member => member
}
}
# The Document AI processor itself. Location is immutable post-creation.
resource "google_document_ai_processor" "this" {
project = var.project_id
location = var.location
display_name = var.display_name
type = var.processor_type
# Optional CMEK: encrypt processed documents at rest with your own key.
kms_key_name = var.kms_key_name
}
# Pin the live default version so extraction output is stable and reproducible.
# Only created when an explicit version is supplied; otherwise Google manages it.
resource "google_document_ai_processor_default_version" "pinned" {
count = var.default_processor_version != null ? 1 : 0
processor = google_document_ai_processor.this.id
version = "${google_document_ai_processor.this.id}/processorVersions/${var.default_processor_version}"
}
# Least-privilege: identities allowed to call the processor at runtime.
resource "google_project_iam_member" "user" {
for_each = local.user_bindings
project = var.project_id
role = "roles/documentai.apiUser"
member = each.value
}
# Optional: identities allowed to administer the processor (create versions, etc.).
resource "google_project_iam_member" "admin" {
for_each = toset(var.processor_admin_members)
project = var.project_id
role = "roles/documentai.editor"
member = each.value
}
variables.tf
variable "project_id" {
description = "GCP project ID that owns the Document AI processor."
type = string
}
variable "location" {
description = "Document AI multi-region or region. Immutable after creation."
type = string
default = "us"
validation {
condition = contains(["us", "eu"], var.location) || can(regex("^[a-z]+-[a-z]+[0-9]$", var.location))
error_message = "location must be 'us', 'eu', or a region like 'us-central1' / 'europe-west4'."
}
}
variable "display_name" {
description = "Human-readable processor name shown in the console."
type = string
validation {
condition = length(var.display_name) >= 1 && length(var.display_name) <= 64
error_message = "display_name must be between 1 and 64 characters."
}
}
variable "processor_type" {
description = "Processor type, e.g. INVOICE_PROCESSOR, FORM_PARSER_PROCESSOR, OCR_PROCESSOR, CUSTOM_EXTRACTION_PROCESSOR. Run 'gcloud documentai processor-types list' for the valid set in your location."
type = string
validation {
condition = can(regex("^[A-Z][A-Z0-9_]+$", var.processor_type))
error_message = "processor_type must be an upper-snake-case identifier such as INVOICE_PROCESSOR."
}
}
variable "kms_key_name" {
description = "Optional Cloud KMS CryptoKey resource ID for CMEK at rest. Must live in the same location as the processor. Null = Google-managed encryption."
type = string
default = null
validation {
condition = var.kms_key_name == null || can(regex("^projects/.+/locations/.+/keyRings/.+/cryptoKeys/.+$", var.kms_key_name))
error_message = "kms_key_name must be a full CryptoKey resource ID: projects/.../locations/.../keyRings/.../cryptoKeys/..."
}
}
variable "default_processor_version" {
description = "Optional processor version ID (the short ID, e.g. 'pretrained-invoice-v1.3-2022-07-15') to pin as the default. Null lets Google manage the default version."
type = string
default = null
}
variable "processor_user_members" {
description = "IAM members granted roles/documentai.apiUser (runtime invoke), e.g. [\"serviceAccount:invoke-sa@proj.iam.gserviceaccount.com\"]."
type = list(string)
default = []
}
variable "processor_admin_members" {
description = "IAM members granted roles/documentai.editor (manage processor & versions)."
type = list(string)
default = []
}
outputs.tf
output "id" {
description = "Fully-qualified processor resource ID: projects/{project}/locations/{location}/processors/{processor}."
value = google_document_ai_processor.this.id
}
output "name" {
description = "Server-assigned processor name (the short processor ID segment)."
value = google_document_ai_processor.this.name
}
output "display_name" {
description = "Human-readable display name of the processor."
value = google_document_ai_processor.this.display_name
}
output "location" {
description = "Region/multi-region the processor lives in (immutable)."
value = google_document_ai_processor.this.location
}
output "process_endpoint" {
description = "Regional REST endpoint for online processing of this processor."
value = "https://${google_document_ai_processor.this.location}-documentai.googleapis.com/v1/${google_document_ai_processor.this.id}:process"
}
output "default_version_pinned" {
description = "Whether an explicit default processor version is pinned by this module."
value = var.default_processor_version != null
}
How to use it
module "document_ai" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-document-ai?ref=v1.0.0"
project_id = "kloudvin-docpipeline-prod"
location = "eu" # keep invoice data in the EU multi-region
display_name = "ap-invoice-parser-prod"
processor_type = "INVOICE_PROCESSOR"
# Pin the model so extracted fields never change without a code review.
default_processor_version = "pretrained-invoice-v1.3-2022-07-15"
# CMEK at rest, key co-located in the EU.
kms_key_name = "projects/kloudvin-docpipeline-prod/locations/europe-west4/keyRings/docai/cryptoKeys/docai-cmek"
# Runtime function invokes the processor; CI service account administers it.
processor_user_members = ["serviceAccount:invoice-fn@kloudvin-docpipeline-prod.iam.gserviceaccount.com"]
processor_admin_members = ["serviceAccount:terraform-ci@kloudvin-docpipeline-prod.iam.gserviceaccount.com"]
}
# Downstream: a Cloud Function that calls the processor, wired via module outputs.
resource "google_cloudfunctions2_function" "invoice_extractor" {
name = "invoice-extractor"
location = "europe-west4"
project = "kloudvin-docpipeline-prod"
build_config {
runtime = "python312"
entry_point = "extract"
source {
storage_source {
bucket = "kloudvin-fn-source"
object = "invoice-extractor.zip"
}
}
}
service_config {
service_account_email = "invoice-fn@kloudvin-docpipeline-prod.iam.gserviceaccount.com"
environment_variables = {
# The function reads these instead of hardcoding the generated processor ID.
DOCAI_PROCESSOR_ID = module.document_ai.id
DOCAI_ENDPOINT = module.document_ai.process_endpoint
}
}
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "gcs"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...gcs state bucket/container + key per path...
}
}
2. Module config — live/prod/document_ai/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-gcp-document-ai?ref=v1.0.0"
}
inputs = {
project_id = "..."
display_name = "..."
processor_type = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/document_ai && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| project_id | string | — | Yes | GCP project ID that owns the processor. |
| location | string | "us" |
No | Multi-region (us/eu) or region (e.g. us-central1). Immutable after creation. |
| display_name | string | — | Yes | Human-readable name (1–64 chars) shown in the console. |
| processor_type | string | — | Yes | Processor type, e.g. INVOICE_PROCESSOR, FORM_PARSER_PROCESSOR, OCR_PROCESSOR. |
| kms_key_name | string | null |
No | Full Cloud KMS CryptoKey ID for CMEK at rest; must be co-located with the processor. |
| default_processor_version | string | null |
No | Short version ID to pin as default; null lets Google manage it. |
| processor_user_members | list(string) | [] |
No | Members granted roles/documentai.apiUser (runtime invoke). |
| processor_admin_members | list(string) | [] |
No | Members granted roles/documentai.editor (manage processor/versions). |
Outputs
| Name | Description |
|---|---|
| id | Fully-qualified processor resource ID (projects/.../locations/.../processors/...). |
| name | Server-assigned short processor ID segment. |
| display_name | Human-readable display name of the processor. |
| location | Region/multi-region the processor lives in. |
| process_endpoint | Regional REST endpoint for online :process calls. |
| default_version_pinned | Boolean indicating whether an explicit default version is pinned. |
Enterprise scenario
A pan-European insurer ingests roughly 40,000 scanned claim invoices per day. An upload to a regional Cloud Storage bucket fires Eventarc, which triggers a Cloud Run job that calls this module’s INVOICE_PROCESSOR (deployed to the eu multi-region with CMEK so no document leaves EU jurisdiction or is encrypted with a key the insurer doesn’t control). Because the module pins default_processor_version, the finance team’s downstream reconciliation logic stays stable through Google’s quarterly model refreshes, and any version bump is a reviewed pull request rather than a silent change. The same Terraform definition is reused verbatim in a staging project against synthetic invoices, giving the team a faithful pre-prod rehearsal of extraction accuracy.
Best practices
- Pin the processor version in production. Leaving the default unpinned means Google can auto-upgrade the model and shift your extracted field values; set
default_processor_versionand treat bumps as code-reviewed changes. - Choose location for data residency, and remember it’s permanent. A processor cannot be moved between
us,eu, or regions after creation — migrating means create-new + reprocess, so get this right up front and co-locate the CMEK key in the same location. - Use CMEK for regulated documents. Supply
kms_key_nameso invoices, IDs, and contracts are encrypted at rest with a key you rotate and can revoke; grant the Document AI service agentroles/cloudkms.cryptoKeyEncrypterDecrypteron that key. - Split runtime from admin identities. Give your function/Cloud Run service account only
roles/documentai.apiUser(invoke) and reserveroles/documentai.editorfor CI — never let the runtime identity create or delete processors. - Watch cost: prefer batch over online for volume.
batchProcessis materially cheaper per page and avoids per-request quotas; reserve synchronous:processfor low-latency, single-document paths and cap consumers with quota alerts. - Name processors with environment and purpose. A
display_namelikeap-invoice-parser-prodmakes the console and audit logs self-describing across multiple processors in one project.