Terraform Module: AWS DataSync — repeatable, scheduled data transfers without bespoke scripts

Quick take — Wrap aws_datasync_task in a reusable Terraform module: locations, scheduled transfers, verification, bandwidth throttling, CloudWatch logging, and filter rules for AWS DataSync migrations. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "datasync" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-datasync?ref=v1.0.0"

  name                               = "..."  # Task name; seeds default log group and resource policy …
  source_s3_bucket_arn               = "..."  # ARN of the source S3 bucket.
  source_bucket_access_role_arn      = "..."  # IAM role DataSync assumes to read the source bucket.
  destination_s3_bucket_arn          = "..."  # ARN of the destination S3 bucket.
  destination_bucket_access_role_arn = "..."  # IAM role DataSync assumes to write the destination buck…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

AWS DataSync is a managed data-transfer service that moves files and objects between storage systems — on-premises NFS/SMB shares, self-managed object stores, and AWS storage like S3, EFS, and FSx — handling encryption in transit, integrity verification, retries, and incremental scans for you. A aws_datasync_task is the unit of work: it binds a source location to a destination location and carries the options that govern how the copy runs (verification mode, overwrite behaviour, file metadata preservation, bandwidth throttling) plus an optional schedule and include/exclude filters.

The trouble with raw DataSync is that a working transfer is rarely one resource. You need at least two aws_datasync_location_* resources, the task itself, a CloudWatch Log Group with the right resource policy so the agent can write logs, and a coherent block of task options that production teams almost always get wrong on the first pass (people forget OverwriteMode, leave VerifyMode on the expensive default, or never set LogLevel). This module wraps an S3-to-S3 (cross-account / cross-region) task — the most common managed-transfer shape — into one var-driven unit so every transfer in your estate is created the same way: consistently named, logged, optionally scheduled, and with throttling and filters exposed as inputs instead of hand-edited per task.

When to use it

One-time or recurring S3-to-S3 replication across accounts or regions where you want verification and a managed retry/incremental engine rather than a Lambda + aws s3 sync cron job.
Lift-and-shift data seeding — pre-loading a target bucket (e.g. a new analytics lake or a DR copy) and then keeping it warm on a schedule.
Governed, auditable transfers where security needs CloudWatch logs of every object transferred/verified and a stable, reviewable definition in Git.
Standardising many similar tasks — when you have dozens of buckets to sync and want one module call each instead of copy-pasted location + task + log-group blocks.

If you need agent-based on-prem NFS/SMB transfers, you’d extend this pattern with aws_datasync_location_nfs/_smb and an aws_datasync_agent; the task wiring here is identical, only the location resources change.

Module structure

terraform-module-aws-datasync/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf

# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# main.tf

locals {
  task_name      = var.name
  log_group_name = coalesce(var.cloudwatch_log_group_name, "/aws/datasync/${var.name}")

  # DataSync only emits per-object/transfer logs when a log level is set AND a
  # log group ARN is attached. We expose the level but always wire the group.
  log_level = var.log_level
}

# ---------------------------------------------------------------------------
# CloudWatch Logs target + resource policy so the DataSync service can write
# ---------------------------------------------------------------------------
resource "aws_cloudwatch_log_group" "this" {
  count = var.create_log_group ? 1 : 0

  name              = local.log_group_name
  retention_in_days = var.log_retention_in_days
  kms_key_id        = var.log_kms_key_arn
  tags              = var.tags
}

data "aws_iam_policy_document" "log_resource_policy" {
  count = var.create_log_group ? 1 : 0

  statement {
    sid    = "DataSyncLogsToCloudWatch"
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["datasync.amazonaws.com"]
    }

    actions = [
      "logs:PutLogEvents",
      "logs:CreateLogStream",
    ]

    resources = ["${aws_cloudwatch_log_group.this[0].arn}:*"]
  }
}

resource "aws_cloudwatch_log_resource_policy" "this" {
  count = var.create_log_group ? 1 : 0

  policy_name     = "${var.name}-datasync-logs"
  policy_document = data.aws_iam_policy_document.log_resource_policy[0].json
}

# ---------------------------------------------------------------------------
# Source + destination S3 locations
# ---------------------------------------------------------------------------
resource "aws_datasync_location_s3" "source" {
  s3_bucket_arn = var.source_s3_bucket_arn
  subdirectory  = var.source_subdirectory

  s3_config {
    bucket_access_role_arn = var.source_bucket_access_role_arn
  }

  s3_storage_class = var.source_s3_storage_class
  tags             = var.tags
}

resource "aws_datasync_location_s3" "destination" {
  s3_bucket_arn = var.destination_s3_bucket_arn
  subdirectory  = var.destination_subdirectory

  s3_config {
    bucket_access_role_arn = var.destination_bucket_access_role_arn
  }

  s3_storage_class = var.destination_s3_storage_class
  tags             = var.tags
}

# ---------------------------------------------------------------------------
# The transfer task
# ---------------------------------------------------------------------------
resource "aws_datasync_task" "this" {
  name                     = local.task_name
  source_location_arn      = aws_datasync_location_s3.source.arn
  destination_location_arn = aws_datasync_location_s3.destination.arn

  cloudwatch_log_group_arn = var.create_log_group ? aws_cloudwatch_log_group.this[0].arn : var.cloudwatch_log_group_arn

  options {
    verify_mode            = var.verify_mode
    overwrite_mode         = var.overwrite_mode
    transfer_mode          = var.transfer_mode
    preserve_deleted_files = var.preserve_deleted_files
    bytes_per_second       = var.bytes_per_second
    task_queueing          = var.task_queueing
    log_level              = local.log_level
    posix_permissions      = "NONE"
    uid                    = "NONE"
    gid                    = "NONE"
    atime                  = "NONE"
    mtime                  = "NONE"
  }

  # Optional include/exclude filters. AWS DataSync joins multiple patterns in a
  # single rule with "|", so we collapse the provided list accordingly.
  dynamic "includes" {
    for_each = length(var.includes) > 0 ? [1] : []
    content {
      filter_type = "SIMPLE_PATTERN"
      value       = join("|", var.includes)
    }
  }

  dynamic "excludes" {
    for_each = length(var.excludes) > 0 ? [1] : []
    content {
      filter_type = "SIMPLE_PATTERN"
      value       = join("|", var.excludes)
    }
  }

  # Run on a cron schedule only when one is supplied.
  dynamic "schedule" {
    for_each = var.schedule_expression == null ? [] : [var.schedule_expression]
    content {
      schedule_expression = schedule.value
    }
  }

  tags = var.tags
}

# variables.tf

variable "name" {
  description = "Name for the DataSync task; also seeds the default log group and resource policy names."
  type        = string

  validation {
    condition     = can(regex("^[A-Za-z0-9_.-]{1,256}$", var.name))
    error_message = "name may contain only letters, numbers, and the characters _ . - and must be 1-256 chars."
  }
}

# --- Source location -------------------------------------------------------
variable "source_s3_bucket_arn" {
  description = "ARN of the source S3 bucket."
  type        = string

  validation {
    condition     = can(regex("^arn:aws[a-z-]*:s3:::", var.source_s3_bucket_arn))
    error_message = "source_s3_bucket_arn must be a valid S3 bucket ARN (arn:aws:s3:::bucket)."
  }
}

variable "source_subdirectory" {
  description = "Prefix within the source bucket to read from (must start with /)."
  type        = string
  default     = "/"

  validation {
    condition     = startswith(var.source_subdirectory, "/")
    error_message = "source_subdirectory must start with a forward slash."
  }
}

variable "source_bucket_access_role_arn" {
  description = "IAM role ARN DataSync assumes to read the source bucket (must trust datasync.amazonaws.com)."
  type        = string
}

variable "source_s3_storage_class" {
  description = "Storage class DataSync treats the source as. Usually STANDARD; set INTELLIGENT_TIERING/GLACIER only when reading restored objects."
  type        = string
  default     = "STANDARD"
}

# --- Destination location --------------------------------------------------
variable "destination_s3_bucket_arn" {
  description = "ARN of the destination S3 bucket."
  type        = string

  validation {
    condition     = can(regex("^arn:aws[a-z-]*:s3:::", var.destination_s3_bucket_arn))
    error_message = "destination_s3_bucket_arn must be a valid S3 bucket ARN (arn:aws:s3:::bucket)."
  }
}

variable "destination_subdirectory" {
  description = "Prefix within the destination bucket to write to (must start with /)."
  type        = string
  default     = "/"

  validation {
    condition     = startswith(var.destination_subdirectory, "/")
    error_message = "destination_subdirectory must start with a forward slash."
  }
}

variable "destination_bucket_access_role_arn" {
  description = "IAM role ARN DataSync assumes to write the destination bucket (must trust datasync.amazonaws.com)."
  type        = string
}

variable "destination_s3_storage_class" {
  description = "Storage class for objects written to the destination bucket."
  type        = string
  default     = "STANDARD"

  validation {
    condition = contains([
      "STANDARD", "STANDARD_IA", "ONEZONE_IA", "INTELLIGENT_TIERING",
      "GLACIER", "GLACIER_INSTANT_RETRIEVAL", "DEEP_ARCHIVE", "OUTPOSTS"
    ], var.destination_s3_storage_class)
    error_message = "destination_s3_storage_class is not a valid S3 DataSync storage class."
  }
}

# --- Task options ----------------------------------------------------------
variable "verify_mode" {
  description = "Integrity check: ONLY_FILES_TRANSFERRED (cheap, recommended), POINT_IN_TIME_CONSISTENT (full scan), or NONE."
  type        = string
  default     = "ONLY_FILES_TRANSFERRED"

  validation {
    condition     = contains(["ONLY_FILES_TRANSFERRED", "POINT_IN_TIME_CONSISTENT", "NONE"], var.verify_mode)
    error_message = "verify_mode must be ONLY_FILES_TRANSFERRED, POINT_IN_TIME_CONSISTENT, or NONE."
  }
}

variable "overwrite_mode" {
  description = "ALWAYS overwrites changed destination objects; NEVER skips objects that already exist."
  type        = string
  default     = "ALWAYS"

  validation {
    condition     = contains(["ALWAYS", "NEVER"], var.overwrite_mode)
    error_message = "overwrite_mode must be ALWAYS or NEVER."
  }
}

variable "transfer_mode" {
  description = "CHANGED copies only files that differ (incremental); ALL copies every file each run."
  type        = string
  default     = "CHANGED"

  validation {
    condition     = contains(["CHANGED", "ALL"], var.transfer_mode)
    error_message = "transfer_mode must be CHANGED or ALL."
  }
}

variable "preserve_deleted_files" {
  description = "PRESERVE keeps files in the destination that no longer exist in the source; REMOVE deletes them (true mirror)."
  type        = string
  default     = "PRESERVE"

  validation {
    condition     = contains(["PRESERVE", "REMOVE"], var.preserve_deleted_files)
    error_message = "preserve_deleted_files must be PRESERVE or REMOVE."
  }
}

variable "bytes_per_second" {
  description = "Bandwidth throttle in bytes/sec. -1 means unlimited (use the full link)."
  type        = number
  default     = -1

  validation {
    condition     = var.bytes_per_second == -1 || var.bytes_per_second >= 1048576
    error_message = "bytes_per_second must be -1 (unlimited) or at least 1048576 (1 MiB/s)."
  }
}

variable "task_queueing" {
  description = "ENABLED queues a new execution if one is already running instead of failing."
  type        = string
  default     = "ENABLED"

  validation {
    condition     = contains(["ENABLED", "DISABLED"], var.task_queueing)
    error_message = "task_queueing must be ENABLED or DISABLED."
  }
}

variable "log_level" {
  description = "DataSync log verbosity to CloudWatch: OFF, BASIC, or TRANSFER (per-object)."
  type        = string
  default     = "TRANSFER"

  validation {
    condition     = contains(["OFF", "BASIC", "TRANSFER"], var.log_level)
    error_message = "log_level must be OFF, BASIC, or TRANSFER."
  }
}

# --- Filters & schedule ----------------------------------------------------
variable "includes" {
  description = "List of SIMPLE_PATTERN globs to include (e.g. [\"/reports/*\", \"/2026/*\"]). Empty = include everything."
  type        = list(string)
  default     = []
}

variable "excludes" {
  description = "List of SIMPLE_PATTERN globs to exclude (e.g. [\"*/.snapshot/*\", \"*/_temp/*\"]). Empty = exclude nothing."
  type        = list(string)
  default     = []
}

variable "schedule_expression" {
  description = "Cron/rate expression to run the task automatically (e.g. \"cron(0 2 * * ? *)\"). null = no schedule (run on demand)."
  type        = string
  default     = null
}

# --- Logging plumbing ------------------------------------------------------
variable "create_log_group" {
  description = "Create and wire a CloudWatch Log Group (with the required resource policy). Set false to reuse an existing one via cloudwatch_log_group_arn."
  type        = bool
  default     = true
}

variable "cloudwatch_log_group_name" {
  description = "Override the managed log group name. Defaults to /aws/datasync/<name>."
  type        = string
  default     = null
}

variable "cloudwatch_log_group_arn" {
  description = "ARN of an existing log group to attach when create_log_group = false."
  type        = string
  default     = null
}

variable "log_retention_in_days" {
  description = "Retention for the managed log group."
  type        = number
  default     = 30
}

variable "log_kms_key_arn" {
  description = "Optional KMS key ARN to encrypt the managed log group."
  type        = string
  default     = null
}

variable "tags" {
  description = "Tags applied to all created resources."
  type        = map(string)
  default     = {}
}

# outputs.tf

output "task_arn" {
  description = "ARN of the DataSync task (use to start executions or build alarms)."
  value       = aws_datasync_task.this.arn
}

output "task_id" {
  description = "ID of the DataSync task."
  value       = aws_datasync_task.this.id
}

output "task_name" {
  description = "Name of the DataSync task."
  value       = aws_datasync_task.this.name
}

output "source_location_arn" {
  description = "ARN of the source S3 location."
  value       = aws_datasync_location_s3.source.arn
}

output "destination_location_arn" {
  description = "ARN of the destination S3 location."
  value       = aws_datasync_location_s3.destination.arn
}

output "cloudwatch_log_group_arn" {
  description = "ARN of the CloudWatch Log Group receiving transfer logs."
  value       = var.create_log_group ? aws_cloudwatch_log_group.this[0].arn : var.cloudwatch_log_group_arn
}

How to use it

module "datasync" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-datasync?ref=v1.0.0"

  name = "prod-lake-to-dr-nightly"

  # Source: production data lake (this account)
  source_s3_bucket_arn          = aws_s3_bucket.prod_lake.arn
  source_subdirectory           = "/curated/"
  source_bucket_access_role_arn = aws_iam_role.datasync_source.arn

  # Destination: DR bucket in another region/account
  destination_s3_bucket_arn          = "arn:aws:s3:::acme-dr-lake-apse2"
  destination_subdirectory           = "/curated/"
  destination_bucket_access_role_arn = aws_iam_role.datasync_destination.arn
  destination_s3_storage_class       = "STANDARD_IA"

  # Incremental nightly mirror, throttled to 100 MiB/s during the window
  transfer_mode          = "CHANGED"
  verify_mode            = "ONLY_FILES_TRANSFERRED"
  overwrite_mode         = "ALWAYS"
  preserve_deleted_files = "REMOVE"          # true mirror for DR
  bytes_per_second       = 104857600         # 100 MiB/s
  schedule_expression    = "cron(0 18 * * ? *)" # 18:00 UTC daily

  excludes = ["*/_temp/*", "*/.spark-staging/*"]

  log_level             = "TRANSFER"
  log_retention_in_days = 90

  tags = {
    Environment = "prod"
    Workload    = "dr"
    ManagedBy   = "terraform"
  }
}

# Downstream: alarm on a failed task execution using the task ARN output.
resource "aws_cloudwatch_metric_alarm" "datasync_failures" {
  alarm_name          = "datasync-${module.datasync.task_name}-failures"
  namespace           = "AWS/DataSync"
  metric_name         = "FilesFailed"
  statistic           = "Sum"
  comparison_operator = "GreaterThanThreshold"
  threshold           = 0
  period              = 3600
  evaluation_periods  = 1
  treat_missing_data  = "notBreaching"

  dimensions = {
    TaskId = module.datasync.task_id
  }

  alarm_actions = [aws_sns_topic.ops_alerts.arn]
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module config — live/prod/datasync/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-datasync?ref=v1.0.0"
}

inputs = {
  name = "..."
  source_s3_bucket_arn = "..."
  source_bucket_access_role_arn = "..."
  destination_s3_bucket_arn = "..."
  destination_bucket_access_role_arn = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/datasync && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
name	string	—	yes	Task name; seeds default log group and resource policy names.
source_s3_bucket_arn	string	—	yes	ARN of the source S3 bucket.
source_subdirectory	string	`"/"`	no	Prefix within the source bucket (must start with `/`).
source_bucket_access_role_arn	string	—	yes	IAM role DataSync assumes to read the source bucket.
source_s3_storage_class	string	`"STANDARD"`	no	Storage class DataSync treats the source as.
destination_s3_bucket_arn	string	—	yes	ARN of the destination S3 bucket.
destination_subdirectory	string	`"/"`	no	Prefix within the destination bucket (must start with `/`).
destination_bucket_access_role_arn	string	—	yes	IAM role DataSync assumes to write the destination bucket.
destination_s3_storage_class	string	`"STANDARD"`	no	Storage class for written objects (validated against the allowed set).
verify_mode	string	`"ONLY_FILES_TRANSFERRED"`	no	Integrity check mode.
overwrite_mode	string	`"ALWAYS"`	no	`ALWAYS` or `NEVER` for changed destination objects.
transfer_mode	string	`"CHANGED"`	no	`CHANGED` (incremental) or `ALL`.
preserve_deleted_files	string	`"PRESERVE"`	no	`PRESERVE` or `REMOVE` (true mirror).
bytes_per_second	number	`-1`	no	Bandwidth throttle; `-1` = unlimited, else ≥ 1 MiB/s.
task_queueing	string	`"ENABLED"`	no	Queue overlapping executions instead of failing.
log_level	string	`"TRANSFER"`	no	`OFF`, `BASIC`, or `TRANSFER`.
includes	list(string)	`[]`	no	SIMPLE_PATTERN globs to include.
excludes	list(string)	`[]`	no	SIMPLE_PATTERN globs to exclude.
schedule_expression	string	`null`	no	Cron/rate expression; `null` = run on demand.
create_log_group	bool	`true`	no	Create + wire the log group and resource policy.
cloudwatch_log_group_name	string	`null`	no	Override managed log group name.
cloudwatch_log_group_arn	string	`null`	no	Existing log group ARN when `create_log_group = false`.
log_retention_in_days	number	`30`	no	Retention for the managed log group.
log_kms_key_arn	string	`null`	no	Optional KMS key for log encryption.
tags	map(string)	`{}`	no	Tags applied to all resources.

Outputs

Name	Description
task_arn	ARN of the DataSync task (start executions, build alarms).
task_id	ID of the DataSync task (the `TaskId` CloudWatch dimension).
task_name	Name of the DataSync task.
source_location_arn	ARN of the source S3 location.
destination_location_arn	ARN of the destination S3 location.
cloudwatch_log_group_arn	ARN of the log group receiving transfer logs.

Enterprise scenario

A financial-services firm runs its regulated data lake in eu-west-1 and must hold an immutable, region-isolated DR copy in eu-central-1 with a 24-hour RPO. They instantiate this module once per lake zone (raw, curated, published), each with preserve_deleted_files = "REMOVE" for a faithful mirror, schedule_expression = "cron(0 18 * * ? *)", and bytes_per_second capped to 100 MiB/s so nightly replication never starves their batch ETL. Because every task writes per-object TRANSFER logs to a KMS-encrypted CloudWatch group retained for 90 days, audit can prove exactly which objects moved and were verified — and the FilesFailed alarm wired to the task_id output pages the on-call team before the RPO is breached.

Best practices

Scope the access roles tightly, per direction. The source role needs s3:GetObject/s3:ListBucket (and kms:Decrypt for SSE-KMS source buckets); the destination role needs s3:PutObject/s3:GetObject/s3:ListBucket/s3:DeleteObject (and kms:GenerateDataKey). Both trust policies must allow datasync.amazonaws.com — never reuse one over-broad role for both ends.
Keep verify_mode = ONLY_FILES_TRANSFERRED unless you genuinely need it. POINT_IN_TIME_CONSISTENT re-scans the entire destination on every run, inflating cost and runtime on large buckets; reserve it for compliance snapshots, not nightly incrementals.
Throttle scheduled tasks with bytes_per_second. DataSync will saturate the link by default. Setting an explicit cap (e.g. 100 MiB/s) protects production traffic and, on cross-region copies, keeps egress charges predictable.
Choose preserve_deleted_files deliberately. REMOVE gives a true mirror (right for DR) but propagates accidental source deletions — pair it with destination bucket versioning/Object Lock. Use PRESERVE for append-only archival landing zones.
Cut cost with filters and storage class. excludes for staging/temp prefixes avoids paying to copy throwaway data, and setting destination_s3_storage_class = "STANDARD_IA" or GLACIER_INSTANT_RETRIEVAL on cold DR copies trims storage spend — just remember DataSync bills per GB transferred regardless of class.
Name and tag for fleet operability. A scheme like <env>-<source>-to-<dest>-<cadence> plus consistent tags makes tasks, their /aws/datasync/<name> log groups, and FilesTransferred/FilesFailed metrics line up across dozens of transfers when you triage at 3 a.m.

Terraform Module: AWS DataSync — repeatable, scheduled data transfers without bespoke scripts

Quickstart (copy-paste)

What this module is

When to use it

Module structure

How to use it

With Terragrunt

Inputs

Outputs

Enterprise scenario

Best practices

Written by Vinod

Comments

Keep Reading

The Terraform Architecting Ladder: From a Single Module to an Enterprise IaC Platform

HashiCorp Terraform Associate (003) Prep Kit: Objectives, Practice Questions & Cheat Sheet

Terraform Fundamentals: HCL, Providers, State & the Core Workflow