Terraform Module: AWS Macie — Automated S3 Data Discovery and PII Classification

Quick take — A reusable Terraform module for hashicorp/aws ~> 5.0 that enables Amazon Macie, schedules sensitive-data discovery jobs over S3, and ships findings to EventBridge for security automation. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "macie" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-macie?ref=v1.0.0"

  name_prefix         = "..."           # Prefix for job and identifier names (2-40 lowercase cha…
  target_bucket_names = ["...", "..."]  # S3 buckets in this account to inspect (≥1).
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Amazon Macie is AWS’s managed data-security service. Once enabled in an account and Region, it continuously inventories your S3 buckets, surfaces buckets that are public, unencrypted, or shared outside the org, and — when you run a classification job — uses managed and custom data identifiers to find sensitive data such as PII, credentials, PHI, and financial records inside objects. The catch is that Macie has a lot of moving parts: the account-level toggle (aws_macie2_account), per-job discovery configuration (aws_macie2_classification_job), custom regex identifiers (aws_macie2_custom_data_identifier), and the publishing configuration that routes findings to Security Hub.

Wrapping all of that in a Terraform module gives you one auditable, version-pinned unit that you can roll out to every account in an Organization identically. Instead of clicking through the console (and forgetting to set the finding-publishing frequency, or leaving auto_enable off for new member accounts), you declare the desired posture once: Macie on, a recurring weekly scan targeting your data-lake buckets, a custom identifier for your internal employee-ID format, and findings flowing to EventBridge. The module also solves the lifecycle ordering problem — aws_macie2_account must exist before any classification job or custom identifier can be created — by expressing those dependencies in HCL rather than in a runbook.

When to use it

Multi-account governance. You manage Macie through a delegated administrator account and need an identical baseline (enabled, correct publishing frequency, standard jobs) replicated across dozens of accounts via Terraform pipelines.
Compliance mandates. PCI-DSS, HIPAA, GDPR, or SOC 2 require you to demonstrate continuous discovery and classification of regulated data in S3, with an auditable IaC trail of when and how it was configured.
Data-lake and analytics platforms. You ingest third-party or customer data into S3 and want recurring scans plus custom identifiers for organisation-specific tokens (internal customer IDs, contract numbers) that the managed identifiers don’t catch.
Security automation. You want findings on EventBridge so a downstream Lambda or Step Functions workflow can auto-quarantine a bucket, open a Jira ticket, or page the on-call team.

Skip it (or use only the account toggle) if you have a single sandbox account with no S3 data worth scanning — Macie bills per bucket evaluated and per GB inspected, so a full module deployment there is wasted spend.

Module structure

terraform-module-aws-macie/
├── versions.tf      # provider + Terraform version constraints
├── main.tf          # aws_macie2_account + classification job + custom identifier
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # account id, job id/arn, identifier id

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # A one-time classification job has no schedule; recurring jobs require a
  # schedule_frequency block. We normalise that here so the resource stays clean.
  is_recurring = var.job_type == "SCHEDULED"

  name_prefix = "${var.name_prefix}-macie"
}

# 1. Account-level enablement. Everything else depends on this existing first.
resource "aws_macie2_account" "this" {
  status                       = var.account_status
  finding_publishing_frequency = var.finding_publishing_frequency
}

# 2. Optional custom data identifier — your org-specific regex (e.g. employee IDs).
resource "aws_macie2_custom_data_identifier" "this" {
  count = var.custom_data_identifier_regex == null ? 0 : 1

  name                   = "${local.name_prefix}-${var.custom_data_identifier_name}"
  description            = var.custom_data_identifier_description
  regex                  = var.custom_data_identifier_regex
  keywords               = var.custom_data_identifier_keywords
  ignore_words           = var.custom_data_identifier_ignore_words
  maximum_match_distance = var.custom_data_identifier_match_distance

  # The identifier is meaningless until Macie is on for the account.
  depends_on = [aws_macie2_account.this]
}

# 3. Sensitive-data discovery job over the target S3 buckets.
resource "aws_macie2_classification_job" "this" {
  count = var.create_classification_job ? 1 : 0

  name        = "${local.name_prefix}-${var.job_name}"
  description = var.job_description
  job_type    = var.job_type

  # Hook in the custom identifier we created above, plus any pre-existing ones.
  custom_data_identifier_ids = compact(concat(
    aws_macie2_custom_data_identifier.this[*].id,
    var.additional_custom_data_identifier_ids,
  ))

  sampling_percentage = var.sampling_percentage

  s3_job_definition {
    bucket_definitions {
      account_id = data.aws_caller_identity.current.account_id
      buckets    = var.target_bucket_names
    }
  }

  # Only emit a schedule when this is a recurring job.
  dynamic "schedule_frequency" {
    for_each = local.is_recurring ? [1] : []
    content {
      daily_schedule   = var.schedule_unit == "DAILY" ? true : null
      weekly_schedule  = var.schedule_unit == "WEEKLY" ? var.weekly_schedule_day : null
      monthly_schedule = var.schedule_unit == "MONTHLY" ? var.monthly_schedule_day : null
    }
  }

  tags = var.tags

  # Once a job is created its definition is immutable; let TF replace it cleanly.
  lifecycle {
    create_before_destroy = true
  }

  depends_on = [aws_macie2_account.this]
}

data "aws_caller_identity" "current" {}

variables.tf

variable "name_prefix" {
  description = "Prefix applied to Macie job and custom-identifier names (e.g. team or environment)."
  type        = string

  validation {
    condition     = can(regex("^[a-z0-9-]{2,40}$", var.name_prefix))
    error_message = "name_prefix must be 2-40 chars, lowercase letters, digits, or hyphens."
  }
}

variable "account_status" {
  description = "Whether Macie is ENABLED or PAUSED for the account."
  type        = string
  default     = "ENABLED"

  validation {
    condition     = contains(["ENABLED", "PAUSED"], var.account_status)
    error_message = "account_status must be either ENABLED or PAUSED."
  }
}

variable "finding_publishing_frequency" {
  description = "How often Macie publishes updated findings to EventBridge / Security Hub."
  type        = string
  default     = "FIFTEEN_MINUTES"

  validation {
    condition = contains(
      ["FIFTEEN_MINUTES", "ONE_HOUR", "SIX_HOURS"],
      var.finding_publishing_frequency
    )
    error_message = "Must be FIFTEEN_MINUTES, ONE_HOUR, or SIX_HOURS."
  }
}

# ---- Custom data identifier (all optional) ----

variable "custom_data_identifier_regex" {
  description = "Regex for an org-specific sensitive token. Set to null to skip creating an identifier."
  type        = string
  default     = null
}

variable "custom_data_identifier_name" {
  description = "Short name suffix for the custom data identifier."
  type        = string
  default     = "custom-id"
}

variable "custom_data_identifier_description" {
  description = "Human-readable description of what the custom identifier detects."
  type        = string
  default     = "Organisation-specific sensitive data identifier managed by Terraform."
}

variable "custom_data_identifier_keywords" {
  description = "Keywords that must appear near a regex match for it to count (proximity filter)."
  type        = list(string)
  default     = []
}

variable "custom_data_identifier_ignore_words" {
  description = "Words that, if found in a match, cause Macie to ignore it (reduces false positives)."
  type        = list(string)
  default     = []
}

variable "custom_data_identifier_match_distance" {
  description = "Max characters between a keyword and the regex match (1-300)."
  type        = number
  default     = 50

  validation {
    condition     = var.custom_data_identifier_match_distance >= 1 && var.custom_data_identifier_match_distance <= 300
    error_message = "maximum_match_distance must be between 1 and 300."
  }
}

# ---- Classification job ----

variable "create_classification_job" {
  description = "Whether to create a sensitive-data discovery job."
  type        = bool
  default     = true
}

variable "job_name" {
  description = "Short name suffix for the classification job."
  type        = string
  default     = "discovery"
}

variable "job_description" {
  description = "Description of the classification job."
  type        = string
  default     = "Recurring S3 sensitive-data discovery managed by Terraform."
}

variable "job_type" {
  description = "ONE_TIME for a single run, SCHEDULED for recurring."
  type        = string
  default     = "SCHEDULED"

  validation {
    condition     = contains(["ONE_TIME", "SCHEDULED"], var.job_type)
    error_message = "job_type must be ONE_TIME or SCHEDULED."
  }
}

variable "target_bucket_names" {
  description = "S3 bucket names (in this account) for the job to inspect."
  type        = list(string)

  validation {
    condition     = length(var.target_bucket_names) > 0
    error_message = "Provide at least one target bucket name."
  }
}

variable "sampling_percentage" {
  description = "Percentage of eligible objects to analyse (1-100). Lower values reduce cost."
  type        = number
  default     = 100

  validation {
    condition     = var.sampling_percentage >= 1 && var.sampling_percentage <= 100
    error_message = "sampling_percentage must be between 1 and 100."
  }
}

variable "schedule_unit" {
  description = "Cadence for a SCHEDULED job: DAILY, WEEKLY, or MONTHLY."
  type        = string
  default     = "WEEKLY"

  validation {
    condition     = contains(["DAILY", "WEEKLY", "MONTHLY"], var.schedule_unit)
    error_message = "schedule_unit must be DAILY, WEEKLY, or MONTHLY."
  }
}

variable "weekly_schedule_day" {
  description = "Day of week for a WEEKLY job (e.g. MONDAY). Ignored otherwise."
  type        = string
  default     = "MONDAY"

  validation {
    condition = contains(
      ["SUNDAY", "MONDAY", "TUESDAY", "WEDNESDAY", "THURSDAY", "FRIDAY", "SATURDAY"],
      var.weekly_schedule_day
    )
    error_message = "weekly_schedule_day must be a valid uppercase day name."
  }
}

variable "monthly_schedule_day" {
  description = "Day of month for a MONTHLY job (1-31). Ignored otherwise."
  type        = number
  default     = 1

  validation {
    condition     = var.monthly_schedule_day >= 1 && var.monthly_schedule_day <= 31
    error_message = "monthly_schedule_day must be between 1 and 31."
  }
}

variable "additional_custom_data_identifier_ids" {
  description = "IDs of pre-existing custom data identifiers to attach to the job."
  type        = list(string)
  default     = []
}

variable "tags" {
  description = "Tags applied to the classification job."
  type        = map(string)
  default     = {}
}

outputs.tf

output "macie_account_id" {
  description = "The unique identifier (Macie account ID) for the enabled account."
  value       = aws_macie2_account.this.id
}

output "macie_service_role" {
  description = "The service-linked role ARN Macie uses to access resources."
  value       = aws_macie2_account.this.service_role
}

output "finding_publishing_frequency" {
  description = "The effective finding-publishing frequency on the account."
  value       = aws_macie2_account.this.finding_publishing_frequency
}

output "classification_job_id" {
  description = "ID of the classification job (null if not created)."
  value       = try(aws_macie2_classification_job.this[0].id, null)
}

output "classification_job_arn" {
  description = "ARN of the classification job (null if not created)."
  value       = try(aws_macie2_classification_job.this[0].job_arn, null)
}

output "custom_data_identifier_id" {
  description = "ID of the custom data identifier (null if not created)."
  value       = try(aws_macie2_custom_data_identifier.this[0].id, null)
}

How to use it

module "macie" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-macie?ref=v1.0.0"

  name_prefix                  = "platform-prod"
  account_status               = "ENABLED"
  finding_publishing_frequency = "ONE_HOUR"

  # Recurring weekly scan of the data-lake landing buckets.
  create_classification_job = true
  job_type                  = "SCHEDULED"
  schedule_unit             = "WEEKLY"
  weekly_schedule_day       = "SUNDAY"
  sampling_percentage       = 100

  target_bucket_names = [
    "platform-prod-raw-landing",
    "platform-prod-customer-uploads",
  ]

  # Custom identifier for our internal employee-ID format: "EMP-" + 7 digits.
  custom_data_identifier_regex       = "EMP-[0-9]{7}"
  custom_data_identifier_name        = "employee-id"
  custom_data_identifier_description = "Internal employee identifier (EMP-#######)."
  custom_data_identifier_keywords    = ["employee", "staff", "badge"]
  custom_data_identifier_match_distance = 30

  tags = {
    Team        = "security"
    Environment = "prod"
    ManagedBy   = "terraform"
  }
}

# Downstream: route Macie findings to a security-automation Lambda via EventBridge.
resource "aws_cloudwatch_event_rule" "macie_findings" {
  name        = "capture-macie-findings"
  description = "Forward Macie sensitive-data findings to automation."

  event_pattern = jsonencode({
    source        = ["aws.macie"]
    "detail-type" = ["Macie Finding"]
  })

  # Ensures the rule is only created once Macie is actually publishing findings.
  depends_on = [module.macie]
}

output "data_security_account" {
  description = "Macie account ID for the security inventory dashboard."
  value       = module.macie.macie_account_id
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module config — live/prod/macie/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-macie?ref=v1.0.0"
}

inputs = {
  name_prefix = "..."
  target_bucket_names = ["...", "..."]
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/macie && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`name_prefix`	`string`	—	Yes	Prefix for job and identifier names (2-40 lowercase chars/digits/hyphens).
`account_status`	`string`	`"ENABLED"`	No	`ENABLED` or `PAUSED` for the Macie account.
`finding_publishing_frequency`	`string`	`"FIFTEEN_MINUTES"`	No	`FIFTEEN_MINUTES`, `ONE_HOUR`, or `SIX_HOURS`.
`custom_data_identifier_regex`	`string`	`null`	No	Regex for an org-specific token; `null` skips the identifier.
`custom_data_identifier_name`	`string`	`"custom-id"`	No	Name suffix for the custom identifier.
`custom_data_identifier_description`	`string`	(managed text)	No	Description of the custom identifier.
`custom_data_identifier_keywords`	`list(string)`	`[]`	No	Proximity keywords required near a match.
`custom_data_identifier_ignore_words`	`list(string)`	`[]`	No	Words that suppress a match (false-positive control).
`custom_data_identifier_match_distance`	`number`	`50`	No	Max chars between keyword and match (1-300).
`create_classification_job`	`bool`	`true`	No	Whether to create a discovery job.
`job_name`	`string`	`"discovery"`	No	Name suffix for the job.
`job_description`	`string`	(managed text)	No	Description of the job.
`job_type`	`string`	`"SCHEDULED"`	No	`ONE_TIME` or `SCHEDULED`.
`target_bucket_names`	`list(string)`	—	Yes	S3 buckets in this account to inspect (≥1).
`sampling_percentage`	`number`	`100`	No	Percentage of objects to analyse (1-100).
`schedule_unit`	`string`	`"WEEKLY"`	No	`DAILY`, `WEEKLY`, or `MONTHLY` for scheduled jobs.
`weekly_schedule_day`	`string`	`"MONDAY"`	No	Day of week for a weekly job.
`monthly_schedule_day`	`number`	`1`	No	Day of month (1-31) for a monthly job.
`additional_custom_data_identifier_ids`	`list(string)`	`[]`	No	Pre-existing identifier IDs to attach to the job.
`tags`	`map(string)`	`{}`	No	Tags applied to the classification job.

Outputs

Name	Description
`macie_account_id`	Unique Macie account identifier for the enabled account.
`macie_service_role`	Service-linked role ARN Macie uses to access resources.
`finding_publishing_frequency`	Effective finding-publishing frequency on the account.
`classification_job_id`	ID of the classification job (`null` if not created).
`classification_job_arn`	ARN of the classification job (`null` if not created).
`custom_data_identifier_id`	ID of the custom data identifier (`null` if not created).

Enterprise scenario

A fintech operating under PCI-DSS runs a data lake where partner banks drop transaction files into per-tenant S3 buckets. The platform team consumes this module from their delegated Macie administrator account, deploying it through the org’s account-vending pipeline so every new tenant account comes up with Macie enabled, ONE_HOUR publishing, and a weekly Sunday-night scan of the landing buckets. They attach a custom identifier matching their internal contract-reference format (which the managed credit-card and PII identifiers miss), and the EventBridge rule shown above feeds findings into a Step Functions workflow that auto-tags any bucket with unencrypted cardholder data and pages the on-call engineer — turning a quarterly manual audit into continuous, evidenced enforcement.

Best practices

Enable at the org level, then scope jobs narrowly. Turn Macie on everywhere via a delegated administrator with auto_enable, but target classification jobs only at buckets that actually hold data — Macie charges per bucket evaluated for inventory and per GB inspected, so blanket “scan everything” jobs get expensive fast.
Use sampling_percentage and managed data identifiers to control cost. For very large buckets, sampling at 30-50% on a recurring job still surfaces systemic exposure at a fraction of the GB-inspected bill; reserve 100% for high-sensitivity tenant buckets.
Lean on ignore_words and keywords for custom identifiers. A naive regex (e.g. a 9-digit SSN pattern) generates noise; the proximity keywords and ignore-words reduce false positives so your automation isn’t drowned.
Route findings off-account. Set the publishing frequency deliberately (ONE_HOUR is a sane default — FIFTEEN_MINUTES multiplies EventBridge/Security Hub volume) and forward findings to a central security account, never leaving them siloed where the workload owner can ignore them.
Name resources predictably. The name_prefix keeps job and identifier names consistent across accounts (platform-prod-macie-discovery), which matters when you’re querying findings or jobs across an Organization.
Pin the module and provider. Reference the module by an immutable ?ref=v1.0.0 tag and keep aws ~> 5.0 pinned in versions.tf; Macie’s resource schema (especially schedule_frequency shapes) has shifted across provider minors, and unpinned upgrades can force unexpected job replacement.