Quick take — A reusable Terraform module for hashicorp/aws ~> 5.0 that enables Amazon Macie, schedules sensitive-data discovery jobs over S3, and ships findings to EventBridge for security automation. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "aws" {
region = "us-east-1"
}
module "macie" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-macie?ref=v1.0.0"
name_prefix = "..." # Prefix for job and identifier names (2-40 lowercase cha…
target_bucket_names = ["...", "..."] # S3 buckets in this account to inspect (≥1).
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Amazon Macie is AWS’s managed data-security service. Once enabled in an account and Region, it continuously inventories your S3 buckets, surfaces buckets that are public, unencrypted, or shared outside the org, and — when you run a classification job — uses managed and custom data identifiers to find sensitive data such as PII, credentials, PHI, and financial records inside objects. The catch is that Macie has a lot of moving parts: the account-level toggle (aws_macie2_account), per-job discovery configuration (aws_macie2_classification_job), custom regex identifiers (aws_macie2_custom_data_identifier), and the publishing configuration that routes findings to Security Hub.
Wrapping all of that in a Terraform module gives you one auditable, version-pinned unit that you can roll out to every account in an Organization identically. Instead of clicking through the console (and forgetting to set the finding-publishing frequency, or leaving auto_enable off for new member accounts), you declare the desired posture once: Macie on, a recurring weekly scan targeting your data-lake buckets, a custom identifier for your internal employee-ID format, and findings flowing to EventBridge. The module also solves the lifecycle ordering problem — aws_macie2_account must exist before any classification job or custom identifier can be created — by expressing those dependencies in HCL rather than in a runbook.
When to use it
- Multi-account governance. You manage Macie through a delegated administrator account and need an identical baseline (enabled, correct publishing frequency, standard jobs) replicated across dozens of accounts via Terraform pipelines.
- Compliance mandates. PCI-DSS, HIPAA, GDPR, or SOC 2 require you to demonstrate continuous discovery and classification of regulated data in S3, with an auditable IaC trail of when and how it was configured.
- Data-lake and analytics platforms. You ingest third-party or customer data into S3 and want recurring scans plus custom identifiers for organisation-specific tokens (internal customer IDs, contract numbers) that the managed identifiers don’t catch.
- Security automation. You want findings on EventBridge so a downstream Lambda or Step Functions workflow can auto-quarantine a bucket, open a Jira ticket, or page the on-call team.
Skip it (or use only the account toggle) if you have a single sandbox account with no S3 data worth scanning — Macie bills per bucket evaluated and per GB inspected, so a full module deployment there is wasted spend.
Module structure
terraform-module-aws-macie/
├── versions.tf # provider + Terraform version constraints
├── main.tf # aws_macie2_account + classification job + custom identifier
├── variables.tf # var-driven inputs with validation
└── outputs.tf # account id, job id/arn, identifier id
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
main.tf
locals {
# A one-time classification job has no schedule; recurring jobs require a
# schedule_frequency block. We normalise that here so the resource stays clean.
is_recurring = var.job_type == "SCHEDULED"
name_prefix = "${var.name_prefix}-macie"
}
# 1. Account-level enablement. Everything else depends on this existing first.
resource "aws_macie2_account" "this" {
status = var.account_status
finding_publishing_frequency = var.finding_publishing_frequency
}
# 2. Optional custom data identifier — your org-specific regex (e.g. employee IDs).
resource "aws_macie2_custom_data_identifier" "this" {
count = var.custom_data_identifier_regex == null ? 0 : 1
name = "${local.name_prefix}-${var.custom_data_identifier_name}"
description = var.custom_data_identifier_description
regex = var.custom_data_identifier_regex
keywords = var.custom_data_identifier_keywords
ignore_words = var.custom_data_identifier_ignore_words
maximum_match_distance = var.custom_data_identifier_match_distance
# The identifier is meaningless until Macie is on for the account.
depends_on = [aws_macie2_account.this]
}
# 3. Sensitive-data discovery job over the target S3 buckets.
resource "aws_macie2_classification_job" "this" {
count = var.create_classification_job ? 1 : 0
name = "${local.name_prefix}-${var.job_name}"
description = var.job_description
job_type = var.job_type
# Hook in the custom identifier we created above, plus any pre-existing ones.
custom_data_identifier_ids = compact(concat(
aws_macie2_custom_data_identifier.this[*].id,
var.additional_custom_data_identifier_ids,
))
sampling_percentage = var.sampling_percentage
s3_job_definition {
bucket_definitions {
account_id = data.aws_caller_identity.current.account_id
buckets = var.target_bucket_names
}
}
# Only emit a schedule when this is a recurring job.
dynamic "schedule_frequency" {
for_each = local.is_recurring ? [1] : []
content {
daily_schedule = var.schedule_unit == "DAILY" ? true : null
weekly_schedule = var.schedule_unit == "WEEKLY" ? var.weekly_schedule_day : null
monthly_schedule = var.schedule_unit == "MONTHLY" ? var.monthly_schedule_day : null
}
}
tags = var.tags
# Once a job is created its definition is immutable; let TF replace it cleanly.
lifecycle {
create_before_destroy = true
}
depends_on = [aws_macie2_account.this]
}
data "aws_caller_identity" "current" {}
variables.tf
variable "name_prefix" {
description = "Prefix applied to Macie job and custom-identifier names (e.g. team or environment)."
type = string
validation {
condition = can(regex("^[a-z0-9-]{2,40}$", var.name_prefix))
error_message = "name_prefix must be 2-40 chars, lowercase letters, digits, or hyphens."
}
}
variable "account_status" {
description = "Whether Macie is ENABLED or PAUSED for the account."
type = string
default = "ENABLED"
validation {
condition = contains(["ENABLED", "PAUSED"], var.account_status)
error_message = "account_status must be either ENABLED or PAUSED."
}
}
variable "finding_publishing_frequency" {
description = "How often Macie publishes updated findings to EventBridge / Security Hub."
type = string
default = "FIFTEEN_MINUTES"
validation {
condition = contains(
["FIFTEEN_MINUTES", "ONE_HOUR", "SIX_HOURS"],
var.finding_publishing_frequency
)
error_message = "Must be FIFTEEN_MINUTES, ONE_HOUR, or SIX_HOURS."
}
}
# ---- Custom data identifier (all optional) ----
variable "custom_data_identifier_regex" {
description = "Regex for an org-specific sensitive token. Set to null to skip creating an identifier."
type = string
default = null
}
variable "custom_data_identifier_name" {
description = "Short name suffix for the custom data identifier."
type = string
default = "custom-id"
}
variable "custom_data_identifier_description" {
description = "Human-readable description of what the custom identifier detects."
type = string
default = "Organisation-specific sensitive data identifier managed by Terraform."
}
variable "custom_data_identifier_keywords" {
description = "Keywords that must appear near a regex match for it to count (proximity filter)."
type = list(string)
default = []
}
variable "custom_data_identifier_ignore_words" {
description = "Words that, if found in a match, cause Macie to ignore it (reduces false positives)."
type = list(string)
default = []
}
variable "custom_data_identifier_match_distance" {
description = "Max characters between a keyword and the regex match (1-300)."
type = number
default = 50
validation {
condition = var.custom_data_identifier_match_distance >= 1 && var.custom_data_identifier_match_distance <= 300
error_message = "maximum_match_distance must be between 1 and 300."
}
}
# ---- Classification job ----
variable "create_classification_job" {
description = "Whether to create a sensitive-data discovery job."
type = bool
default = true
}
variable "job_name" {
description = "Short name suffix for the classification job."
type = string
default = "discovery"
}
variable "job_description" {
description = "Description of the classification job."
type = string
default = "Recurring S3 sensitive-data discovery managed by Terraform."
}
variable "job_type" {
description = "ONE_TIME for a single run, SCHEDULED for recurring."
type = string
default = "SCHEDULED"
validation {
condition = contains(["ONE_TIME", "SCHEDULED"], var.job_type)
error_message = "job_type must be ONE_TIME or SCHEDULED."
}
}
variable "target_bucket_names" {
description = "S3 bucket names (in this account) for the job to inspect."
type = list(string)
validation {
condition = length(var.target_bucket_names) > 0
error_message = "Provide at least one target bucket name."
}
}
variable "sampling_percentage" {
description = "Percentage of eligible objects to analyse (1-100). Lower values reduce cost."
type = number
default = 100
validation {
condition = var.sampling_percentage >= 1 && var.sampling_percentage <= 100
error_message = "sampling_percentage must be between 1 and 100."
}
}
variable "schedule_unit" {
description = "Cadence for a SCHEDULED job: DAILY, WEEKLY, or MONTHLY."
type = string
default = "WEEKLY"
validation {
condition = contains(["DAILY", "WEEKLY", "MONTHLY"], var.schedule_unit)
error_message = "schedule_unit must be DAILY, WEEKLY, or MONTHLY."
}
}
variable "weekly_schedule_day" {
description = "Day of week for a WEEKLY job (e.g. MONDAY). Ignored otherwise."
type = string
default = "MONDAY"
validation {
condition = contains(
["SUNDAY", "MONDAY", "TUESDAY", "WEDNESDAY", "THURSDAY", "FRIDAY", "SATURDAY"],
var.weekly_schedule_day
)
error_message = "weekly_schedule_day must be a valid uppercase day name."
}
}
variable "monthly_schedule_day" {
description = "Day of month for a MONTHLY job (1-31). Ignored otherwise."
type = number
default = 1
validation {
condition = var.monthly_schedule_day >= 1 && var.monthly_schedule_day <= 31
error_message = "monthly_schedule_day must be between 1 and 31."
}
}
variable "additional_custom_data_identifier_ids" {
description = "IDs of pre-existing custom data identifiers to attach to the job."
type = list(string)
default = []
}
variable "tags" {
description = "Tags applied to the classification job."
type = map(string)
default = {}
}
outputs.tf
output "macie_account_id" {
description = "The unique identifier (Macie account ID) for the enabled account."
value = aws_macie2_account.this.id
}
output "macie_service_role" {
description = "The service-linked role ARN Macie uses to access resources."
value = aws_macie2_account.this.service_role
}
output "finding_publishing_frequency" {
description = "The effective finding-publishing frequency on the account."
value = aws_macie2_account.this.finding_publishing_frequency
}
output "classification_job_id" {
description = "ID of the classification job (null if not created)."
value = try(aws_macie2_classification_job.this[0].id, null)
}
output "classification_job_arn" {
description = "ARN of the classification job (null if not created)."
value = try(aws_macie2_classification_job.this[0].job_arn, null)
}
output "custom_data_identifier_id" {
description = "ID of the custom data identifier (null if not created)."
value = try(aws_macie2_custom_data_identifier.this[0].id, null)
}
How to use it
module "macie" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-macie?ref=v1.0.0"
name_prefix = "platform-prod"
account_status = "ENABLED"
finding_publishing_frequency = "ONE_HOUR"
# Recurring weekly scan of the data-lake landing buckets.
create_classification_job = true
job_type = "SCHEDULED"
schedule_unit = "WEEKLY"
weekly_schedule_day = "SUNDAY"
sampling_percentage = 100
target_bucket_names = [
"platform-prod-raw-landing",
"platform-prod-customer-uploads",
]
# Custom identifier for our internal employee-ID format: "EMP-" + 7 digits.
custom_data_identifier_regex = "EMP-[0-9]{7}"
custom_data_identifier_name = "employee-id"
custom_data_identifier_description = "Internal employee identifier (EMP-#######)."
custom_data_identifier_keywords = ["employee", "staff", "badge"]
custom_data_identifier_match_distance = 30
tags = {
Team = "security"
Environment = "prod"
ManagedBy = "terraform"
}
}
# Downstream: route Macie findings to a security-automation Lambda via EventBridge.
resource "aws_cloudwatch_event_rule" "macie_findings" {
name = "capture-macie-findings"
description = "Forward Macie sensitive-data findings to automation."
event_pattern = jsonencode({
source = ["aws.macie"]
"detail-type" = ["Macie Finding"]
})
# Ensures the rule is only created once Macie is actually publishing findings.
depends_on = [module.macie]
}
output "data_security_account" {
description = "Macie account ID for the security inventory dashboard."
value = module.macie.macie_account_id
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "s3"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...s3 state bucket/container + key per path...
}
}
2. Module config — live/prod/macie/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-macie?ref=v1.0.0"
}
inputs = {
name_prefix = "..."
target_bucket_names = ["...", "..."]
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/macie && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
name_prefix |
string |
— | Yes | Prefix for job and identifier names (2-40 lowercase chars/digits/hyphens). |
account_status |
string |
"ENABLED" |
No | ENABLED or PAUSED for the Macie account. |
finding_publishing_frequency |
string |
"FIFTEEN_MINUTES" |
No | FIFTEEN_MINUTES, ONE_HOUR, or SIX_HOURS. |
custom_data_identifier_regex |
string |
null |
No | Regex for an org-specific token; null skips the identifier. |
custom_data_identifier_name |
string |
"custom-id" |
No | Name suffix for the custom identifier. |
custom_data_identifier_description |
string |
(managed text) | No | Description of the custom identifier. |
custom_data_identifier_keywords |
list(string) |
[] |
No | Proximity keywords required near a match. |
custom_data_identifier_ignore_words |
list(string) |
[] |
No | Words that suppress a match (false-positive control). |
custom_data_identifier_match_distance |
number |
50 |
No | Max chars between keyword and match (1-300). |
create_classification_job |
bool |
true |
No | Whether to create a discovery job. |
job_name |
string |
"discovery" |
No | Name suffix for the job. |
job_description |
string |
(managed text) | No | Description of the job. |
job_type |
string |
"SCHEDULED" |
No | ONE_TIME or SCHEDULED. |
target_bucket_names |
list(string) |
— | Yes | S3 buckets in this account to inspect (≥1). |
sampling_percentage |
number |
100 |
No | Percentage of objects to analyse (1-100). |
schedule_unit |
string |
"WEEKLY" |
No | DAILY, WEEKLY, or MONTHLY for scheduled jobs. |
weekly_schedule_day |
string |
"MONDAY" |
No | Day of week for a weekly job. |
monthly_schedule_day |
number |
1 |
No | Day of month (1-31) for a monthly job. |
additional_custom_data_identifier_ids |
list(string) |
[] |
No | Pre-existing identifier IDs to attach to the job. |
tags |
map(string) |
{} |
No | Tags applied to the classification job. |
Outputs
| Name | Description |
|---|---|
macie_account_id |
Unique Macie account identifier for the enabled account. |
macie_service_role |
Service-linked role ARN Macie uses to access resources. |
finding_publishing_frequency |
Effective finding-publishing frequency on the account. |
classification_job_id |
ID of the classification job (null if not created). |
classification_job_arn |
ARN of the classification job (null if not created). |
custom_data_identifier_id |
ID of the custom data identifier (null if not created). |
Enterprise scenario
A fintech operating under PCI-DSS runs a data lake where partner banks drop transaction files into per-tenant S3 buckets. The platform team consumes this module from their delegated Macie administrator account, deploying it through the org’s account-vending pipeline so every new tenant account comes up with Macie enabled, ONE_HOUR publishing, and a weekly Sunday-night scan of the landing buckets. They attach a custom identifier matching their internal contract-reference format (which the managed credit-card and PII identifiers miss), and the EventBridge rule shown above feeds findings into a Step Functions workflow that auto-tags any bucket with unencrypted cardholder data and pages the on-call engineer — turning a quarterly manual audit into continuous, evidenced enforcement.
Best practices
- Enable at the org level, then scope jobs narrowly. Turn Macie on everywhere via a delegated administrator with
auto_enable, but target classification jobs only at buckets that actually hold data — Macie charges per bucket evaluated for inventory and per GB inspected, so blanket “scan everything” jobs get expensive fast. - Use
sampling_percentageand managed data identifiers to control cost. For very large buckets, sampling at 30-50% on a recurring job still surfaces systemic exposure at a fraction of the GB-inspected bill; reserve 100% for high-sensitivity tenant buckets. - Lean on
ignore_wordsandkeywordsfor custom identifiers. A naive regex (e.g. a 9-digit SSN pattern) generates noise; the proximity keywords and ignore-words reduce false positives so your automation isn’t drowned. - Route findings off-account. Set the publishing frequency deliberately (
ONE_HOURis a sane default —FIFTEEN_MINUTESmultiplies EventBridge/Security Hub volume) and forward findings to a central security account, never leaving them siloed where the workload owner can ignore them. - Name resources predictably. The
name_prefixkeeps job and identifier names consistent across accounts (platform-prod-macie-discovery), which matters when you’re querying findings or jobs across an Organization. - Pin the module and provider. Reference the module by an immutable
?ref=v1.0.0tag and keepaws ~> 5.0pinned inversions.tf; Macie’s resource schema (especiallyschedule_frequencyshapes) has shifted across provider minors, and unpinned upgrades can force unexpected job replacement.