Quick take — Wrap aws_datasync_task in a reusable Terraform module: locations, scheduled transfers, verification, bandwidth throttling, CloudWatch logging, and filter rules for AWS DataSync migrations. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "aws" {
region = "us-east-1"
}
module "datasync" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-datasync?ref=v1.0.0"
name = "..." # Task name; seeds default log group and resource policy …
source_s3_bucket_arn = "..." # ARN of the source S3 bucket.
source_bucket_access_role_arn = "..." # IAM role DataSync assumes to read the source bucket.
destination_s3_bucket_arn = "..." # ARN of the destination S3 bucket.
destination_bucket_access_role_arn = "..." # IAM role DataSync assumes to write the destination buck…
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
AWS DataSync is a managed data-transfer service that moves files and objects between storage systems — on-premises NFS/SMB shares, self-managed object stores, and AWS storage like S3, EFS, and FSx — handling encryption in transit, integrity verification, retries, and incremental scans for you. A aws_datasync_task is the unit of work: it binds a source location to a destination location and carries the options that govern how the copy runs (verification mode, overwrite behaviour, file metadata preservation, bandwidth throttling) plus an optional schedule and include/exclude filters.
The trouble with raw DataSync is that a working transfer is rarely one resource. You need at least two aws_datasync_location_* resources, the task itself, a CloudWatch Log Group with the right resource policy so the agent can write logs, and a coherent block of task options that production teams almost always get wrong on the first pass (people forget OverwriteMode, leave VerifyMode on the expensive default, or never set LogLevel). This module wraps an S3-to-S3 (cross-account / cross-region) task — the most common managed-transfer shape — into one var-driven unit so every transfer in your estate is created the same way: consistently named, logged, optionally scheduled, and with throttling and filters exposed as inputs instead of hand-edited per task.
When to use it
- One-time or recurring S3-to-S3 replication across accounts or regions where you want verification and a managed retry/incremental engine rather than a Lambda +
aws s3 synccron job. - Lift-and-shift data seeding — pre-loading a target bucket (e.g. a new analytics lake or a DR copy) and then keeping it warm on a schedule.
- Governed, auditable transfers where security needs CloudWatch logs of every object transferred/verified and a stable, reviewable definition in Git.
- Standardising many similar tasks — when you have dozens of buckets to sync and want one module call each instead of copy-pasted location + task + log-group blocks.
If you need agent-based on-prem NFS/SMB transfers, you’d extend this pattern with aws_datasync_location_nfs/_smb and an aws_datasync_agent; the task wiring here is identical, only the location resources change.
Module structure
terraform-module-aws-datasync/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf
# versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# main.tf
locals {
task_name = var.name
log_group_name = coalesce(var.cloudwatch_log_group_name, "/aws/datasync/${var.name}")
# DataSync only emits per-object/transfer logs when a log level is set AND a
# log group ARN is attached. We expose the level but always wire the group.
log_level = var.log_level
}
# ---------------------------------------------------------------------------
# CloudWatch Logs target + resource policy so the DataSync service can write
# ---------------------------------------------------------------------------
resource "aws_cloudwatch_log_group" "this" {
count = var.create_log_group ? 1 : 0
name = local.log_group_name
retention_in_days = var.log_retention_in_days
kms_key_id = var.log_kms_key_arn
tags = var.tags
}
data "aws_iam_policy_document" "log_resource_policy" {
count = var.create_log_group ? 1 : 0
statement {
sid = "DataSyncLogsToCloudWatch"
effect = "Allow"
principals {
type = "Service"
identifiers = ["datasync.amazonaws.com"]
}
actions = [
"logs:PutLogEvents",
"logs:CreateLogStream",
]
resources = ["${aws_cloudwatch_log_group.this[0].arn}:*"]
}
}
resource "aws_cloudwatch_log_resource_policy" "this" {
count = var.create_log_group ? 1 : 0
policy_name = "${var.name}-datasync-logs"
policy_document = data.aws_iam_policy_document.log_resource_policy[0].json
}
# ---------------------------------------------------------------------------
# Source + destination S3 locations
# ---------------------------------------------------------------------------
resource "aws_datasync_location_s3" "source" {
s3_bucket_arn = var.source_s3_bucket_arn
subdirectory = var.source_subdirectory
s3_config {
bucket_access_role_arn = var.source_bucket_access_role_arn
}
s3_storage_class = var.source_s3_storage_class
tags = var.tags
}
resource "aws_datasync_location_s3" "destination" {
s3_bucket_arn = var.destination_s3_bucket_arn
subdirectory = var.destination_subdirectory
s3_config {
bucket_access_role_arn = var.destination_bucket_access_role_arn
}
s3_storage_class = var.destination_s3_storage_class
tags = var.tags
}
# ---------------------------------------------------------------------------
# The transfer task
# ---------------------------------------------------------------------------
resource "aws_datasync_task" "this" {
name = local.task_name
source_location_arn = aws_datasync_location_s3.source.arn
destination_location_arn = aws_datasync_location_s3.destination.arn
cloudwatch_log_group_arn = var.create_log_group ? aws_cloudwatch_log_group.this[0].arn : var.cloudwatch_log_group_arn
options {
verify_mode = var.verify_mode
overwrite_mode = var.overwrite_mode
transfer_mode = var.transfer_mode
preserve_deleted_files = var.preserve_deleted_files
bytes_per_second = var.bytes_per_second
task_queueing = var.task_queueing
log_level = local.log_level
posix_permissions = "NONE"
uid = "NONE"
gid = "NONE"
atime = "NONE"
mtime = "NONE"
}
# Optional include/exclude filters. AWS DataSync joins multiple patterns in a
# single rule with "|", so we collapse the provided list accordingly.
dynamic "includes" {
for_each = length(var.includes) > 0 ? [1] : []
content {
filter_type = "SIMPLE_PATTERN"
value = join("|", var.includes)
}
}
dynamic "excludes" {
for_each = length(var.excludes) > 0 ? [1] : []
content {
filter_type = "SIMPLE_PATTERN"
value = join("|", var.excludes)
}
}
# Run on a cron schedule only when one is supplied.
dynamic "schedule" {
for_each = var.schedule_expression == null ? [] : [var.schedule_expression]
content {
schedule_expression = schedule.value
}
}
tags = var.tags
}
# variables.tf
variable "name" {
description = "Name for the DataSync task; also seeds the default log group and resource policy names."
type = string
validation {
condition = can(regex("^[A-Za-z0-9_.-]{1,256}$", var.name))
error_message = "name may contain only letters, numbers, and the characters _ . - and must be 1-256 chars."
}
}
# --- Source location -------------------------------------------------------
variable "source_s3_bucket_arn" {
description = "ARN of the source S3 bucket."
type = string
validation {
condition = can(regex("^arn:aws[a-z-]*:s3:::", var.source_s3_bucket_arn))
error_message = "source_s3_bucket_arn must be a valid S3 bucket ARN (arn:aws:s3:::bucket)."
}
}
variable "source_subdirectory" {
description = "Prefix within the source bucket to read from (must start with /)."
type = string
default = "/"
validation {
condition = startswith(var.source_subdirectory, "/")
error_message = "source_subdirectory must start with a forward slash."
}
}
variable "source_bucket_access_role_arn" {
description = "IAM role ARN DataSync assumes to read the source bucket (must trust datasync.amazonaws.com)."
type = string
}
variable "source_s3_storage_class" {
description = "Storage class DataSync treats the source as. Usually STANDARD; set INTELLIGENT_TIERING/GLACIER only when reading restored objects."
type = string
default = "STANDARD"
}
# --- Destination location --------------------------------------------------
variable "destination_s3_bucket_arn" {
description = "ARN of the destination S3 bucket."
type = string
validation {
condition = can(regex("^arn:aws[a-z-]*:s3:::", var.destination_s3_bucket_arn))
error_message = "destination_s3_bucket_arn must be a valid S3 bucket ARN (arn:aws:s3:::bucket)."
}
}
variable "destination_subdirectory" {
description = "Prefix within the destination bucket to write to (must start with /)."
type = string
default = "/"
validation {
condition = startswith(var.destination_subdirectory, "/")
error_message = "destination_subdirectory must start with a forward slash."
}
}
variable "destination_bucket_access_role_arn" {
description = "IAM role ARN DataSync assumes to write the destination bucket (must trust datasync.amazonaws.com)."
type = string
}
variable "destination_s3_storage_class" {
description = "Storage class for objects written to the destination bucket."
type = string
default = "STANDARD"
validation {
condition = contains([
"STANDARD", "STANDARD_IA", "ONEZONE_IA", "INTELLIGENT_TIERING",
"GLACIER", "GLACIER_INSTANT_RETRIEVAL", "DEEP_ARCHIVE", "OUTPOSTS"
], var.destination_s3_storage_class)
error_message = "destination_s3_storage_class is not a valid S3 DataSync storage class."
}
}
# --- Task options ----------------------------------------------------------
variable "verify_mode" {
description = "Integrity check: ONLY_FILES_TRANSFERRED (cheap, recommended), POINT_IN_TIME_CONSISTENT (full scan), or NONE."
type = string
default = "ONLY_FILES_TRANSFERRED"
validation {
condition = contains(["ONLY_FILES_TRANSFERRED", "POINT_IN_TIME_CONSISTENT", "NONE"], var.verify_mode)
error_message = "verify_mode must be ONLY_FILES_TRANSFERRED, POINT_IN_TIME_CONSISTENT, or NONE."
}
}
variable "overwrite_mode" {
description = "ALWAYS overwrites changed destination objects; NEVER skips objects that already exist."
type = string
default = "ALWAYS"
validation {
condition = contains(["ALWAYS", "NEVER"], var.overwrite_mode)
error_message = "overwrite_mode must be ALWAYS or NEVER."
}
}
variable "transfer_mode" {
description = "CHANGED copies only files that differ (incremental); ALL copies every file each run."
type = string
default = "CHANGED"
validation {
condition = contains(["CHANGED", "ALL"], var.transfer_mode)
error_message = "transfer_mode must be CHANGED or ALL."
}
}
variable "preserve_deleted_files" {
description = "PRESERVE keeps files in the destination that no longer exist in the source; REMOVE deletes them (true mirror)."
type = string
default = "PRESERVE"
validation {
condition = contains(["PRESERVE", "REMOVE"], var.preserve_deleted_files)
error_message = "preserve_deleted_files must be PRESERVE or REMOVE."
}
}
variable "bytes_per_second" {
description = "Bandwidth throttle in bytes/sec. -1 means unlimited (use the full link)."
type = number
default = -1
validation {
condition = var.bytes_per_second == -1 || var.bytes_per_second >= 1048576
error_message = "bytes_per_second must be -1 (unlimited) or at least 1048576 (1 MiB/s)."
}
}
variable "task_queueing" {
description = "ENABLED queues a new execution if one is already running instead of failing."
type = string
default = "ENABLED"
validation {
condition = contains(["ENABLED", "DISABLED"], var.task_queueing)
error_message = "task_queueing must be ENABLED or DISABLED."
}
}
variable "log_level" {
description = "DataSync log verbosity to CloudWatch: OFF, BASIC, or TRANSFER (per-object)."
type = string
default = "TRANSFER"
validation {
condition = contains(["OFF", "BASIC", "TRANSFER"], var.log_level)
error_message = "log_level must be OFF, BASIC, or TRANSFER."
}
}
# --- Filters & schedule ----------------------------------------------------
variable "includes" {
description = "List of SIMPLE_PATTERN globs to include (e.g. [\"/reports/*\", \"/2026/*\"]). Empty = include everything."
type = list(string)
default = []
}
variable "excludes" {
description = "List of SIMPLE_PATTERN globs to exclude (e.g. [\"*/.snapshot/*\", \"*/_temp/*\"]). Empty = exclude nothing."
type = list(string)
default = []
}
variable "schedule_expression" {
description = "Cron/rate expression to run the task automatically (e.g. \"cron(0 2 * * ? *)\"). null = no schedule (run on demand)."
type = string
default = null
}
# --- Logging plumbing ------------------------------------------------------
variable "create_log_group" {
description = "Create and wire a CloudWatch Log Group (with the required resource policy). Set false to reuse an existing one via cloudwatch_log_group_arn."
type = bool
default = true
}
variable "cloudwatch_log_group_name" {
description = "Override the managed log group name. Defaults to /aws/datasync/<name>."
type = string
default = null
}
variable "cloudwatch_log_group_arn" {
description = "ARN of an existing log group to attach when create_log_group = false."
type = string
default = null
}
variable "log_retention_in_days" {
description = "Retention for the managed log group."
type = number
default = 30
}
variable "log_kms_key_arn" {
description = "Optional KMS key ARN to encrypt the managed log group."
type = string
default = null
}
variable "tags" {
description = "Tags applied to all created resources."
type = map(string)
default = {}
}
# outputs.tf
output "task_arn" {
description = "ARN of the DataSync task (use to start executions or build alarms)."
value = aws_datasync_task.this.arn
}
output "task_id" {
description = "ID of the DataSync task."
value = aws_datasync_task.this.id
}
output "task_name" {
description = "Name of the DataSync task."
value = aws_datasync_task.this.name
}
output "source_location_arn" {
description = "ARN of the source S3 location."
value = aws_datasync_location_s3.source.arn
}
output "destination_location_arn" {
description = "ARN of the destination S3 location."
value = aws_datasync_location_s3.destination.arn
}
output "cloudwatch_log_group_arn" {
description = "ARN of the CloudWatch Log Group receiving transfer logs."
value = var.create_log_group ? aws_cloudwatch_log_group.this[0].arn : var.cloudwatch_log_group_arn
}
How to use it
module "datasync" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-datasync?ref=v1.0.0"
name = "prod-lake-to-dr-nightly"
# Source: production data lake (this account)
source_s3_bucket_arn = aws_s3_bucket.prod_lake.arn
source_subdirectory = "/curated/"
source_bucket_access_role_arn = aws_iam_role.datasync_source.arn
# Destination: DR bucket in another region/account
destination_s3_bucket_arn = "arn:aws:s3:::acme-dr-lake-apse2"
destination_subdirectory = "/curated/"
destination_bucket_access_role_arn = aws_iam_role.datasync_destination.arn
destination_s3_storage_class = "STANDARD_IA"
# Incremental nightly mirror, throttled to 100 MiB/s during the window
transfer_mode = "CHANGED"
verify_mode = "ONLY_FILES_TRANSFERRED"
overwrite_mode = "ALWAYS"
preserve_deleted_files = "REMOVE" # true mirror for DR
bytes_per_second = 104857600 # 100 MiB/s
schedule_expression = "cron(0 18 * * ? *)" # 18:00 UTC daily
excludes = ["*/_temp/*", "*/.spark-staging/*"]
log_level = "TRANSFER"
log_retention_in_days = 90
tags = {
Environment = "prod"
Workload = "dr"
ManagedBy = "terraform"
}
}
# Downstream: alarm on a failed task execution using the task ARN output.
resource "aws_cloudwatch_metric_alarm" "datasync_failures" {
alarm_name = "datasync-${module.datasync.task_name}-failures"
namespace = "AWS/DataSync"
metric_name = "FilesFailed"
statistic = "Sum"
comparison_operator = "GreaterThanThreshold"
threshold = 0
period = 3600
evaluation_periods = 1
treat_missing_data = "notBreaching"
dimensions = {
TaskId = module.datasync.task_id
}
alarm_actions = [aws_sns_topic.ops_alerts.arn]
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "s3"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...s3 state bucket/container + key per path...
}
}
2. Module config — live/prod/datasync/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-datasync?ref=v1.0.0"
}
inputs = {
name = "..."
source_s3_bucket_arn = "..."
source_bucket_access_role_arn = "..."
destination_s3_bucket_arn = "..."
destination_bucket_access_role_arn = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/datasync && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| name | string | — | yes | Task name; seeds default log group and resource policy names. |
| source_s3_bucket_arn | string | — | yes | ARN of the source S3 bucket. |
| source_subdirectory | string | "/" |
no | Prefix within the source bucket (must start with /). |
| source_bucket_access_role_arn | string | — | yes | IAM role DataSync assumes to read the source bucket. |
| source_s3_storage_class | string | "STANDARD" |
no | Storage class DataSync treats the source as. |
| destination_s3_bucket_arn | string | — | yes | ARN of the destination S3 bucket. |
| destination_subdirectory | string | "/" |
no | Prefix within the destination bucket (must start with /). |
| destination_bucket_access_role_arn | string | — | yes | IAM role DataSync assumes to write the destination bucket. |
| destination_s3_storage_class | string | "STANDARD" |
no | Storage class for written objects (validated against the allowed set). |
| verify_mode | string | "ONLY_FILES_TRANSFERRED" |
no | Integrity check mode. |
| overwrite_mode | string | "ALWAYS" |
no | ALWAYS or NEVER for changed destination objects. |
| transfer_mode | string | "CHANGED" |
no | CHANGED (incremental) or ALL. |
| preserve_deleted_files | string | "PRESERVE" |
no | PRESERVE or REMOVE (true mirror). |
| bytes_per_second | number | -1 |
no | Bandwidth throttle; -1 = unlimited, else ≥ 1 MiB/s. |
| task_queueing | string | "ENABLED" |
no | Queue overlapping executions instead of failing. |
| log_level | string | "TRANSFER" |
no | OFF, BASIC, or TRANSFER. |
| includes | list(string) | [] |
no | SIMPLE_PATTERN globs to include. |
| excludes | list(string) | [] |
no | SIMPLE_PATTERN globs to exclude. |
| schedule_expression | string | null |
no | Cron/rate expression; null = run on demand. |
| create_log_group | bool | true |
no | Create + wire the log group and resource policy. |
| cloudwatch_log_group_name | string | null |
no | Override managed log group name. |
| cloudwatch_log_group_arn | string | null |
no | Existing log group ARN when create_log_group = false. |
| log_retention_in_days | number | 30 |
no | Retention for the managed log group. |
| log_kms_key_arn | string | null |
no | Optional KMS key for log encryption. |
| tags | map(string) | {} |
no | Tags applied to all resources. |
Outputs
| Name | Description |
|---|---|
| task_arn | ARN of the DataSync task (start executions, build alarms). |
| task_id | ID of the DataSync task (the TaskId CloudWatch dimension). |
| task_name | Name of the DataSync task. |
| source_location_arn | ARN of the source S3 location. |
| destination_location_arn | ARN of the destination S3 location. |
| cloudwatch_log_group_arn | ARN of the log group receiving transfer logs. |
Enterprise scenario
A financial-services firm runs its regulated data lake in eu-west-1 and must hold an immutable, region-isolated DR copy in eu-central-1 with a 24-hour RPO. They instantiate this module once per lake zone (raw, curated, published), each with preserve_deleted_files = "REMOVE" for a faithful mirror, schedule_expression = "cron(0 18 * * ? *)", and bytes_per_second capped to 100 MiB/s so nightly replication never starves their batch ETL. Because every task writes per-object TRANSFER logs to a KMS-encrypted CloudWatch group retained for 90 days, audit can prove exactly which objects moved and were verified — and the FilesFailed alarm wired to the task_id output pages the on-call team before the RPO is breached.
Best practices
- Scope the access roles tightly, per direction. The source role needs
s3:GetObject/s3:ListBucket(andkms:Decryptfor SSE-KMS source buckets); the destination role needss3:PutObject/s3:GetObject/s3:ListBucket/s3:DeleteObject(andkms:GenerateDataKey). Both trust policies must allowdatasync.amazonaws.com— never reuse one over-broad role for both ends. - Keep
verify_mode = ONLY_FILES_TRANSFERREDunless you genuinely need it.POINT_IN_TIME_CONSISTENTre-scans the entire destination on every run, inflating cost and runtime on large buckets; reserve it for compliance snapshots, not nightly incrementals. - Throttle scheduled tasks with
bytes_per_second. DataSync will saturate the link by default. Setting an explicit cap (e.g. 100 MiB/s) protects production traffic and, on cross-region copies, keeps egress charges predictable. - Choose
preserve_deleted_filesdeliberately.REMOVEgives a true mirror (right for DR) but propagates accidental source deletions — pair it with destination bucket versioning/Object Lock. UsePRESERVEfor append-only archival landing zones. - Cut cost with filters and storage class.
excludesfor staging/temp prefixes avoids paying to copy throwaway data, and settingdestination_s3_storage_class = "STANDARD_IA"orGLACIER_INSTANT_RETRIEVALon cold DR copies trims storage spend — just remember DataSync bills per GB transferred regardless of class. - Name and tag for fleet operability. A scheme like
<env>-<source>-to-<dest>-<cadence>plus consistenttagsmakes tasks, their/aws/datasync/<name>log groups, andFilesTransferred/FilesFailedmetrics line up across dozens of transfers when you triage at 3 a.m.