IaC AWS

Terraform Module: AWS Athena — governed, cost-capped serverless SQL workgroups

Quick take — Provision AWS Athena workgroups with Terraform: enforced S3 result locations, KMS-encrypted output, per-query data-scan limits, CloudWatch metrics, and Athena engine v3 — a reusable hashicorp/aws ~> 5.0 module. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "athena" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-athena?ref=v1.0.0"

  name = "..."  # Workgroup name; also derives the managed results bucket…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Amazon Athena is a serverless, interactive query service that runs ANSI SQL directly over data sitting in Amazon S3 (and, via connectors, other sources) without you provisioning a single cluster. You pay per terabyte of data scanned, which makes Athena cheap to start with and surprisingly easy to turn into a runaway bill — one analyst running SELECT * against an un-partitioned, uncompressed table can scan terabytes in a single click.

The control plane for all of this is the workgroup. A workgroup is where you pin the Athena engine version, force every query to write results to a specific encrypted S3 location, set a hard ceiling on bytes scanned per query, publish CloudWatch metrics, and decide whether client-side settings can override your server-side configuration. Get the workgroup right and Athena is governed and predictable; leave it as primary with defaults and you have no guardrails at all.

This module wraps aws_athena_workgroup (with an optional managed results bucket, a named query, and a data catalog) so that every team gets a workgroup that is encrypted, cost-capped, metric-emitting, and impossible to silently misconfigure — instead of each project hand-rolling its own and forgetting the bytes_scanned_cutoff_per_query limit.

When to use it

Module structure

terraform-module-aws-athena/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  # If the caller does not supply an output location, and we are creating a
  # bucket, default results to s3://<bucket>/<prefix>/. Otherwise use the
  # explicit value (which may be null to inherit account-level settings).
  managed_bucket_name = var.create_results_bucket ? "${var.name}-athena-results-${data.aws_caller_identity.current.account_id}" : null

  results_output_location = coalesce(
    var.output_location,
    var.create_results_bucket ? "s3://${local.managed_bucket_name}/${trim(var.results_prefix, "/")}/" : null
  )

  common_tags = merge(
    {
      "Module"    = "terraform-module-aws-athena"
      "Workgroup" = var.name
    },
    var.tags
  )
}

data "aws_caller_identity" "current" {}

# ---------------------------------------------------------------------------
# Optional: managed, encrypted S3 bucket for Athena query results
# ---------------------------------------------------------------------------
resource "aws_s3_bucket" "results" {
  count = var.create_results_bucket ? 1 : 0

  bucket        = local.managed_bucket_name
  force_destroy = var.results_bucket_force_destroy

  tags = local.common_tags
}

resource "aws_s3_bucket_public_access_block" "results" {
  count = var.create_results_bucket ? 1 : 0

  bucket = aws_s3_bucket.results[0].id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_s3_bucket_server_side_encryption_configuration" "results" {
  count = var.create_results_bucket ? 1 : 0

  bucket = aws_s3_bucket.results[0].id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = var.kms_key_arn == null ? "AES256" : "aws:kms"
      kms_master_key_id = var.kms_key_arn
    }
    bucket_key_enabled = var.kms_key_arn != null
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "results" {
  count = var.create_results_bucket && var.results_expiration_days > 0 ? 1 : 0

  bucket = aws_s3_bucket.results[0].id

  rule {
    id     = "expire-query-results"
    status = "Enabled"

    filter {
      prefix = "${trim(var.results_prefix, "/")}/"
    }

    expiration {
      days = var.results_expiration_days
    }
  }
}

# ---------------------------------------------------------------------------
# Athena workgroup
# ---------------------------------------------------------------------------
resource "aws_athena_workgroup" "this" {
  name        = var.name
  description = var.description
  state       = var.state

  # Prevent Terraform from deleting a workgroup that still has named queries.
  force_destroy = var.force_destroy

  configuration {
    enforce_workgroup_configuration    = var.enforce_workgroup_configuration
    publish_cloudwatch_metrics_enabled = var.publish_cloudwatch_metrics
    bytes_scanned_cutoff_per_query     = var.bytes_scanned_cutoff_per_query
    requester_pays_enabled             = var.requester_pays_enabled

    engine_version {
      selected_engine_version = var.selected_engine_version
    }

    result_configuration {
      output_location       = local.results_output_location
      expected_bucket_owner = var.expected_bucket_owner

      encryption_configuration {
        encryption_option = var.kms_key_arn == null ? "SSE_S3" : "SSE_KMS"
        kms_key           = var.kms_key_arn
      }

      dynamic "acl_configuration" {
        for_each = var.results_bucket_owner_full_control ? [1] : []
        content {
          s3_acl_option = "BUCKET_OWNER_FULL_CONTROL"
        }
      }
    }
  }

  tags = local.common_tags
}

# ---------------------------------------------------------------------------
# Optional: a saved (named) query bound to this workgroup
# ---------------------------------------------------------------------------
resource "aws_athena_named_query" "this" {
  for_each = var.named_queries

  name        = each.key
  description = lookup(each.value, "description", null)
  database    = each.value.database
  query       = each.value.query
  workgroup   = aws_athena_workgroup.this.id
}

# ---------------------------------------------------------------------------
# Optional: a Glue-backed data catalog usable from this workgroup
# ---------------------------------------------------------------------------
resource "aws_athena_data_catalog" "this" {
  count = var.create_glue_data_catalog ? 1 : 0

  name        = "${var.name}-catalog"
  description = "Glue data catalog for the ${var.name} Athena workgroup"
  type        = "GLUE"

  parameters = {
    "catalog-id" = data.aws_caller_identity.current.account_id
  }

  tags = local.common_tags
}

variables.tf

variable "name" {
  description = "Name of the Athena workgroup. Used to derive the managed results bucket and catalog names."
  type        = string

  validation {
    condition     = can(regex("^[A-Za-z0-9._-]{1,128}$", var.name))
    error_message = "Workgroup name must be 1-128 chars and only contain letters, numbers, period, underscore, or hyphen."
  }
}

variable "description" {
  description = "Human-readable description of the workgroup."
  type        = string
  default     = "Managed Athena workgroup"
}

variable "state" {
  description = "Whether the workgroup is ENABLED or DISABLED. A DISABLED workgroup rejects new queries."
  type        = string
  default     = "ENABLED"

  validation {
    condition     = contains(["ENABLED", "DISABLED"], var.state)
    error_message = "state must be either ENABLED or DISABLED."
  }
}

variable "selected_engine_version" {
  description = "Athena engine version to pin. Use \"Athena engine version 3\" for the Trino-based engine, or \"AUTO\" to let AWS choose."
  type        = string
  default     = "Athena engine version 3"
}

variable "enforce_workgroup_configuration" {
  description = "If true, workgroup settings (result location, encryption, scan limit) override any client-side settings. Strongly recommended."
  type        = bool
  default     = true
}

variable "publish_cloudwatch_metrics" {
  description = "Publish per-query CloudWatch metrics (data scanned, query state, run time) under the AmazonAthena/<workgroup> namespace."
  type        = bool
  default     = true
}

variable "bytes_scanned_cutoff_per_query" {
  description = "Hard upper bound on bytes a single query may scan before it is cancelled. Must be >= 10485760 (10 MB) or null to disable. Defaults to 1 TB."
  type        = number
  default     = 1099511627776 # 1 TiB

  validation {
    condition     = var.bytes_scanned_cutoff_per_query == null || var.bytes_scanned_cutoff_per_query >= 10485760
    error_message = "bytes_scanned_cutoff_per_query must be null or at least 10485760 (10 MB)."
  }
}

variable "requester_pays_enabled" {
  description = "If true, queries against requester-pays S3 buckets are allowed (the workgroup pays for requests)."
  type        = bool
  default     = false
}

variable "force_destroy" {
  description = "Allow Terraform to destroy the workgroup even if it contains named queries. Use with care in production."
  type        = bool
  default     = false
}

# --- Query results location / encryption ----------------------------------

variable "output_location" {
  description = "Explicit S3 URI for query results, e.g. s3://my-bucket/athena/. If null and create_results_bucket is true, a managed location is used."
  type        = string
  default     = null

  validation {
    condition     = var.output_location == null || can(regex("^s3://", var.output_location))
    error_message = "output_location must be null or an s3:// URI."
  }
}

variable "kms_key_arn" {
  description = "KMS key ARN to encrypt query results (SSE_KMS). If null, SSE_S3 (AES256) is used instead."
  type        = string
  default     = null
}

variable "expected_bucket_owner" {
  description = "AWS account ID expected to own the results bucket, as a guard against bucket-sniping. Null to skip the check."
  type        = string
  default     = null
}

variable "results_bucket_owner_full_control" {
  description = "If true, write results with the BUCKET_OWNER_FULL_CONTROL canned ACL (useful for cross-account result buckets)."
  type        = bool
  default     = false
}

# --- Managed results bucket -----------------------------------------------

variable "create_results_bucket" {
  description = "Create and manage a dedicated, encrypted S3 bucket for query results."
  type        = bool
  default     = true
}

variable "results_prefix" {
  description = "Key prefix inside the managed results bucket where query output is written."
  type        = string
  default     = "query-results"
}

variable "results_expiration_days" {
  description = "Lifecycle expiration (in days) for objects under results_prefix in the managed bucket. Set 0 to disable expiration."
  type        = number
  default     = 30

  validation {
    condition     = var.results_expiration_days >= 0
    error_message = "results_expiration_days must be >= 0."
  }
}

variable "results_bucket_force_destroy" {
  description = "Allow Terraform to delete the managed results bucket even when it still contains objects."
  type        = bool
  default     = false
}

# --- Optional sub-resources -----------------------------------------------

variable "named_queries" {
  description = "Map of saved named queries to create, keyed by query name. Each value needs database and query; description is optional."
  type = map(object({
    database    = string
    query       = string
    description = optional(string)
  }))
  default = {}
}

variable "create_glue_data_catalog" {
  description = "Register a Glue-backed Athena data catalog (<name>-catalog) for this workgroup."
  type        = bool
  default     = false
}

variable "tags" {
  description = "Additional tags applied to all resources created by this module."
  type        = map(string)
  default     = {}
}

outputs.tf

output "workgroup_id" {
  description = "The ID (name) of the Athena workgroup."
  value       = aws_athena_workgroup.this.id
}

output "workgroup_name" {
  description = "The name of the Athena workgroup, to pass as WorkGroup in StartQueryExecution / the console."
  value       = aws_athena_workgroup.this.name
}

output "workgroup_arn" {
  description = "The ARN of the Athena workgroup, for use in IAM policy conditions."
  value       = aws_athena_workgroup.this.arn
}

output "output_location" {
  description = "Effective S3 URI where query results are written."
  value       = local.results_output_location
}

output "results_bucket_name" {
  description = "Name of the managed results bucket, or null if one was not created."
  value       = try(aws_s3_bucket.results[0].bucket, null)
}

output "results_bucket_arn" {
  description = "ARN of the managed results bucket, or null if one was not created."
  value       = try(aws_s3_bucket.results[0].arn, null)
}

output "named_query_ids" {
  description = "Map of named-query name to its Athena named-query ID."
  value       = { for k, q in aws_athena_named_query.this : k => q.id }
}

output "data_catalog_name" {
  description = "Name of the Glue data catalog created for this workgroup, or null if none."
  value       = try(aws_athena_data_catalog.this[0].name, null)
}

How to use it

module "athena" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-athena?ref=v1.0.0"

  name        = "analytics-prod"
  description = "Production analytics workgroup for the BI team"

  # Pin the modern engine and enforce server-side config on every query.
  selected_engine_version         = "Athena engine version 3"
  enforce_workgroup_configuration = true
  publish_cloudwatch_metrics      = true

  # Hard cost guardrail: cancel any single query that tries to scan > 500 GB.
  bytes_scanned_cutoff_per_query = 536870912000 # 500 GiB

  # Managed, KMS-encrypted results bucket with a 14-day retention.
  create_results_bucket   = true
  kms_key_arn             = aws_kms_key.athena.arn
  results_expiration_days = 14

  named_queries = {
    daily_active_users = {
      database    = "analytics"
      description = "DAU rollup used by the executive dashboard"
      query       = "SELECT event_date, COUNT(DISTINCT user_id) AS dau FROM analytics.events GROUP BY event_date ORDER BY event_date;"
    }
  }

  tags = {
    Team        = "bi"
    Environment = "prod"
    CostCenter  = "data-platform"
  }
}

# Downstream: grant a BI role permission to run queries only in this workgroup,
# using the module's ARN output in the IAM policy condition.
data "aws_iam_policy_document" "bi_athena_access" {
  statement {
    sid    = "RunQueriesInWorkgroup"
    effect = "Allow"
    actions = [
      "athena:StartQueryExecution",
      "athena:GetQueryExecution",
      "athena:GetQueryResults",
      "athena:StopQueryExecution",
    ]
    resources = [module.athena.workgroup_arn]
  }
}

resource "aws_iam_role_policy" "bi_athena" {
  name   = "bi-athena-access"
  role   = aws_iam_role.bi_analyst.id
  policy = data.aws_iam_policy_document.bi_athena_access.json
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module configlive/prod/athena/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-athena?ref=v1.0.0"
}

inputs = {
  name = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/athena && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
name string Yes Workgroup name; also derives the managed results bucket and catalog names.
description string "Managed Athena workgroup" No Human-readable workgroup description.
state string "ENABLED" No ENABLED or DISABLED; a disabled workgroup rejects new queries.
selected_engine_version string "Athena engine version 3" No Pinned Athena engine version (or AUTO).
enforce_workgroup_configuration bool true No Force server-side settings to override client-side settings.
publish_cloudwatch_metrics bool true No Publish per-query CloudWatch metrics.
bytes_scanned_cutoff_per_query number 1099511627776 No Max bytes scanned per query before cancellation (null disables; min 10 MB).
requester_pays_enabled bool false No Allow queries against requester-pays S3 buckets.
force_destroy bool false No Allow destroying the workgroup even with named queries present.
output_location string null No Explicit s3:// results URI; if null, a managed location is used.
kms_key_arn string null No KMS key ARN for SSE_KMS result encryption; null falls back to SSE_S3.
expected_bucket_owner string null No Expected account ID owning the results bucket (anti-sniping guard).
results_bucket_owner_full_control bool false No Write results with the BUCKET_OWNER_FULL_CONTROL ACL.
create_results_bucket bool true No Create a dedicated, encrypted results bucket.
results_prefix string "query-results" No Key prefix for results inside the managed bucket.
results_expiration_days number 30 No Lifecycle expiration (days) for results objects; 0 disables.
results_bucket_force_destroy bool false No Allow deleting the managed bucket even when non-empty.
named_queries map(object) {} No Saved named queries (database, query, optional description) keyed by name.
create_glue_data_catalog bool false No Register a Glue-backed Athena data catalog for the workgroup.
tags map(string) {} No Additional tags applied to all created resources.

Outputs

Name Description
workgroup_id The ID (name) of the Athena workgroup.
workgroup_name The workgroup name to pass as WorkGroup in StartQueryExecution / the console.
workgroup_arn The workgroup ARN, for IAM policy conditions.
output_location Effective S3 URI where query results are written.
results_bucket_name Name of the managed results bucket, or null if not created.
results_bucket_arn ARN of the managed results bucket, or null if not created.
named_query_ids Map of named-query name to its Athena named-query ID.
data_catalog_name Name of the Glue data catalog created, or null if none.

Enterprise scenario

A retail data platform team runs a central data lake in S3 and serves dozens of analysts across finance, marketing, and supply-chain. They instantiate this module once per business domain (finance-prod, marketing-prod, supplychain-prod), each with its own KMS-encrypted results bucket, a 500 GiB bytes_scanned_cutoff_per_query to stop accidental full-lake scans, and enforce_workgroup_configuration = true so analysts cannot redirect results to ungoverned buckets. CloudWatch metrics from each workgroup feed a per-domain cost dashboard and a budget alarm, so finance can chargeback Athena spend accurately and the platform team catches a misbehaving query before it scans terabytes.

Best practices

TerraformAWSAthenaModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading