IaC AWS

Terraform Module: AWS Macie — Automated S3 Data Discovery and PII Classification

Quick take — A reusable Terraform module for hashicorp/aws ~> 5.0 that enables Amazon Macie, schedules sensitive-data discovery jobs over S3, and ships findings to EventBridge for security automation. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "macie" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-macie?ref=v1.0.0"

  name_prefix         = "..."           # Prefix for job and identifier names (2-40 lowercase cha…
  target_bucket_names = ["...", "..."]  # S3 buckets in this account to inspect (≥1).
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Amazon Macie is AWS’s managed data-security service. Once enabled in an account and Region, it continuously inventories your S3 buckets, surfaces buckets that are public, unencrypted, or shared outside the org, and — when you run a classification job — uses managed and custom data identifiers to find sensitive data such as PII, credentials, PHI, and financial records inside objects. The catch is that Macie has a lot of moving parts: the account-level toggle (aws_macie2_account), per-job discovery configuration (aws_macie2_classification_job), custom regex identifiers (aws_macie2_custom_data_identifier), and the publishing configuration that routes findings to Security Hub.

Wrapping all of that in a Terraform module gives you one auditable, version-pinned unit that you can roll out to every account in an Organization identically. Instead of clicking through the console (and forgetting to set the finding-publishing frequency, or leaving auto_enable off for new member accounts), you declare the desired posture once: Macie on, a recurring weekly scan targeting your data-lake buckets, a custom identifier for your internal employee-ID format, and findings flowing to EventBridge. The module also solves the lifecycle ordering problem — aws_macie2_account must exist before any classification job or custom identifier can be created — by expressing those dependencies in HCL rather than in a runbook.

When to use it

Skip it (or use only the account toggle) if you have a single sandbox account with no S3 data worth scanning — Macie bills per bucket evaluated and per GB inspected, so a full module deployment there is wasted spend.

Module structure

terraform-module-aws-macie/
├── versions.tf      # provider + Terraform version constraints
├── main.tf          # aws_macie2_account + classification job + custom identifier
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # account id, job id/arn, identifier id

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # A one-time classification job has no schedule; recurring jobs require a
  # schedule_frequency block. We normalise that here so the resource stays clean.
  is_recurring = var.job_type == "SCHEDULED"

  name_prefix = "${var.name_prefix}-macie"
}

# 1. Account-level enablement. Everything else depends on this existing first.
resource "aws_macie2_account" "this" {
  status                       = var.account_status
  finding_publishing_frequency = var.finding_publishing_frequency
}

# 2. Optional custom data identifier — your org-specific regex (e.g. employee IDs).
resource "aws_macie2_custom_data_identifier" "this" {
  count = var.custom_data_identifier_regex == null ? 0 : 1

  name                   = "${local.name_prefix}-${var.custom_data_identifier_name}"
  description            = var.custom_data_identifier_description
  regex                  = var.custom_data_identifier_regex
  keywords               = var.custom_data_identifier_keywords
  ignore_words           = var.custom_data_identifier_ignore_words
  maximum_match_distance = var.custom_data_identifier_match_distance

  # The identifier is meaningless until Macie is on for the account.
  depends_on = [aws_macie2_account.this]
}

# 3. Sensitive-data discovery job over the target S3 buckets.
resource "aws_macie2_classification_job" "this" {
  count = var.create_classification_job ? 1 : 0

  name        = "${local.name_prefix}-${var.job_name}"
  description = var.job_description
  job_type    = var.job_type

  # Hook in the custom identifier we created above, plus any pre-existing ones.
  custom_data_identifier_ids = compact(concat(
    aws_macie2_custom_data_identifier.this[*].id,
    var.additional_custom_data_identifier_ids,
  ))

  sampling_percentage = var.sampling_percentage

  s3_job_definition {
    bucket_definitions {
      account_id = data.aws_caller_identity.current.account_id
      buckets    = var.target_bucket_names
    }
  }

  # Only emit a schedule when this is a recurring job.
  dynamic "schedule_frequency" {
    for_each = local.is_recurring ? [1] : []
    content {
      daily_schedule   = var.schedule_unit == "DAILY" ? true : null
      weekly_schedule  = var.schedule_unit == "WEEKLY" ? var.weekly_schedule_day : null
      monthly_schedule = var.schedule_unit == "MONTHLY" ? var.monthly_schedule_day : null
    }
  }

  tags = var.tags

  # Once a job is created its definition is immutable; let TF replace it cleanly.
  lifecycle {
    create_before_destroy = true
  }

  depends_on = [aws_macie2_account.this]
}

data "aws_caller_identity" "current" {}

variables.tf

variable "name_prefix" {
  description = "Prefix applied to Macie job and custom-identifier names (e.g. team or environment)."
  type        = string

  validation {
    condition     = can(regex("^[a-z0-9-]{2,40}$", var.name_prefix))
    error_message = "name_prefix must be 2-40 chars, lowercase letters, digits, or hyphens."
  }
}

variable "account_status" {
  description = "Whether Macie is ENABLED or PAUSED for the account."
  type        = string
  default     = "ENABLED"

  validation {
    condition     = contains(["ENABLED", "PAUSED"], var.account_status)
    error_message = "account_status must be either ENABLED or PAUSED."
  }
}

variable "finding_publishing_frequency" {
  description = "How often Macie publishes updated findings to EventBridge / Security Hub."
  type        = string
  default     = "FIFTEEN_MINUTES"

  validation {
    condition = contains(
      ["FIFTEEN_MINUTES", "ONE_HOUR", "SIX_HOURS"],
      var.finding_publishing_frequency
    )
    error_message = "Must be FIFTEEN_MINUTES, ONE_HOUR, or SIX_HOURS."
  }
}

# ---- Custom data identifier (all optional) ----

variable "custom_data_identifier_regex" {
  description = "Regex for an org-specific sensitive token. Set to null to skip creating an identifier."
  type        = string
  default     = null
}

variable "custom_data_identifier_name" {
  description = "Short name suffix for the custom data identifier."
  type        = string
  default     = "custom-id"
}

variable "custom_data_identifier_description" {
  description = "Human-readable description of what the custom identifier detects."
  type        = string
  default     = "Organisation-specific sensitive data identifier managed by Terraform."
}

variable "custom_data_identifier_keywords" {
  description = "Keywords that must appear near a regex match for it to count (proximity filter)."
  type        = list(string)
  default     = []
}

variable "custom_data_identifier_ignore_words" {
  description = "Words that, if found in a match, cause Macie to ignore it (reduces false positives)."
  type        = list(string)
  default     = []
}

variable "custom_data_identifier_match_distance" {
  description = "Max characters between a keyword and the regex match (1-300)."
  type        = number
  default     = 50

  validation {
    condition     = var.custom_data_identifier_match_distance >= 1 && var.custom_data_identifier_match_distance <= 300
    error_message = "maximum_match_distance must be between 1 and 300."
  }
}

# ---- Classification job ----

variable "create_classification_job" {
  description = "Whether to create a sensitive-data discovery job."
  type        = bool
  default     = true
}

variable "job_name" {
  description = "Short name suffix for the classification job."
  type        = string
  default     = "discovery"
}

variable "job_description" {
  description = "Description of the classification job."
  type        = string
  default     = "Recurring S3 sensitive-data discovery managed by Terraform."
}

variable "job_type" {
  description = "ONE_TIME for a single run, SCHEDULED for recurring."
  type        = string
  default     = "SCHEDULED"

  validation {
    condition     = contains(["ONE_TIME", "SCHEDULED"], var.job_type)
    error_message = "job_type must be ONE_TIME or SCHEDULED."
  }
}

variable "target_bucket_names" {
  description = "S3 bucket names (in this account) for the job to inspect."
  type        = list(string)

  validation {
    condition     = length(var.target_bucket_names) > 0
    error_message = "Provide at least one target bucket name."
  }
}

variable "sampling_percentage" {
  description = "Percentage of eligible objects to analyse (1-100). Lower values reduce cost."
  type        = number
  default     = 100

  validation {
    condition     = var.sampling_percentage >= 1 && var.sampling_percentage <= 100
    error_message = "sampling_percentage must be between 1 and 100."
  }
}

variable "schedule_unit" {
  description = "Cadence for a SCHEDULED job: DAILY, WEEKLY, or MONTHLY."
  type        = string
  default     = "WEEKLY"

  validation {
    condition     = contains(["DAILY", "WEEKLY", "MONTHLY"], var.schedule_unit)
    error_message = "schedule_unit must be DAILY, WEEKLY, or MONTHLY."
  }
}

variable "weekly_schedule_day" {
  description = "Day of week for a WEEKLY job (e.g. MONDAY). Ignored otherwise."
  type        = string
  default     = "MONDAY"

  validation {
    condition = contains(
      ["SUNDAY", "MONDAY", "TUESDAY", "WEDNESDAY", "THURSDAY", "FRIDAY", "SATURDAY"],
      var.weekly_schedule_day
    )
    error_message = "weekly_schedule_day must be a valid uppercase day name."
  }
}

variable "monthly_schedule_day" {
  description = "Day of month for a MONTHLY job (1-31). Ignored otherwise."
  type        = number
  default     = 1

  validation {
    condition     = var.monthly_schedule_day >= 1 && var.monthly_schedule_day <= 31
    error_message = "monthly_schedule_day must be between 1 and 31."
  }
}

variable "additional_custom_data_identifier_ids" {
  description = "IDs of pre-existing custom data identifiers to attach to the job."
  type        = list(string)
  default     = []
}

variable "tags" {
  description = "Tags applied to the classification job."
  type        = map(string)
  default     = {}
}

outputs.tf

output "macie_account_id" {
  description = "The unique identifier (Macie account ID) for the enabled account."
  value       = aws_macie2_account.this.id
}

output "macie_service_role" {
  description = "The service-linked role ARN Macie uses to access resources."
  value       = aws_macie2_account.this.service_role
}

output "finding_publishing_frequency" {
  description = "The effective finding-publishing frequency on the account."
  value       = aws_macie2_account.this.finding_publishing_frequency
}

output "classification_job_id" {
  description = "ID of the classification job (null if not created)."
  value       = try(aws_macie2_classification_job.this[0].id, null)
}

output "classification_job_arn" {
  description = "ARN of the classification job (null if not created)."
  value       = try(aws_macie2_classification_job.this[0].job_arn, null)
}

output "custom_data_identifier_id" {
  description = "ID of the custom data identifier (null if not created)."
  value       = try(aws_macie2_custom_data_identifier.this[0].id, null)
}

How to use it

module "macie" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-macie?ref=v1.0.0"

  name_prefix                  = "platform-prod"
  account_status               = "ENABLED"
  finding_publishing_frequency = "ONE_HOUR"

  # Recurring weekly scan of the data-lake landing buckets.
  create_classification_job = true
  job_type                  = "SCHEDULED"
  schedule_unit             = "WEEKLY"
  weekly_schedule_day       = "SUNDAY"
  sampling_percentage       = 100

  target_bucket_names = [
    "platform-prod-raw-landing",
    "platform-prod-customer-uploads",
  ]

  # Custom identifier for our internal employee-ID format: "EMP-" + 7 digits.
  custom_data_identifier_regex       = "EMP-[0-9]{7}"
  custom_data_identifier_name        = "employee-id"
  custom_data_identifier_description = "Internal employee identifier (EMP-#######)."
  custom_data_identifier_keywords    = ["employee", "staff", "badge"]
  custom_data_identifier_match_distance = 30

  tags = {
    Team        = "security"
    Environment = "prod"
    ManagedBy   = "terraform"
  }
}

# Downstream: route Macie findings to a security-automation Lambda via EventBridge.
resource "aws_cloudwatch_event_rule" "macie_findings" {
  name        = "capture-macie-findings"
  description = "Forward Macie sensitive-data findings to automation."

  event_pattern = jsonencode({
    source        = ["aws.macie"]
    "detail-type" = ["Macie Finding"]
  })

  # Ensures the rule is only created once Macie is actually publishing findings.
  depends_on = [module.macie]
}

output "data_security_account" {
  description = "Macie account ID for the security inventory dashboard."
  value       = module.macie.macie_account_id
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module configlive/prod/macie/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-macie?ref=v1.0.0"
}

inputs = {
  name_prefix = "..."
  target_bucket_names = ["...", "..."]
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/macie && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
name_prefix string Yes Prefix for job and identifier names (2-40 lowercase chars/digits/hyphens).
account_status string "ENABLED" No ENABLED or PAUSED for the Macie account.
finding_publishing_frequency string "FIFTEEN_MINUTES" No FIFTEEN_MINUTES, ONE_HOUR, or SIX_HOURS.
custom_data_identifier_regex string null No Regex for an org-specific token; null skips the identifier.
custom_data_identifier_name string "custom-id" No Name suffix for the custom identifier.
custom_data_identifier_description string (managed text) No Description of the custom identifier.
custom_data_identifier_keywords list(string) [] No Proximity keywords required near a match.
custom_data_identifier_ignore_words list(string) [] No Words that suppress a match (false-positive control).
custom_data_identifier_match_distance number 50 No Max chars between keyword and match (1-300).
create_classification_job bool true No Whether to create a discovery job.
job_name string "discovery" No Name suffix for the job.
job_description string (managed text) No Description of the job.
job_type string "SCHEDULED" No ONE_TIME or SCHEDULED.
target_bucket_names list(string) Yes S3 buckets in this account to inspect (≥1).
sampling_percentage number 100 No Percentage of objects to analyse (1-100).
schedule_unit string "WEEKLY" No DAILY, WEEKLY, or MONTHLY for scheduled jobs.
weekly_schedule_day string "MONDAY" No Day of week for a weekly job.
monthly_schedule_day number 1 No Day of month (1-31) for a monthly job.
additional_custom_data_identifier_ids list(string) [] No Pre-existing identifier IDs to attach to the job.
tags map(string) {} No Tags applied to the classification job.

Outputs

Name Description
macie_account_id Unique Macie account identifier for the enabled account.
macie_service_role Service-linked role ARN Macie uses to access resources.
finding_publishing_frequency Effective finding-publishing frequency on the account.
classification_job_id ID of the classification job (null if not created).
classification_job_arn ARN of the classification job (null if not created).
custom_data_identifier_id ID of the custom data identifier (null if not created).

Enterprise scenario

A fintech operating under PCI-DSS runs a data lake where partner banks drop transaction files into per-tenant S3 buckets. The platform team consumes this module from their delegated Macie administrator account, deploying it through the org’s account-vending pipeline so every new tenant account comes up with Macie enabled, ONE_HOUR publishing, and a weekly Sunday-night scan of the landing buckets. They attach a custom identifier matching their internal contract-reference format (which the managed credit-card and PII identifiers miss), and the EventBridge rule shown above feeds findings into a Step Functions workflow that auto-tags any bucket with unencrypted cardholder data and pages the on-call engineer — turning a quarterly manual audit into continuous, evidenced enforcement.

Best practices

TerraformAWSMacieModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading