IaC AWS

Terraform Module: AWS DataSync — repeatable, scheduled data transfers without bespoke scripts

Quick take — Wrap aws_datasync_task in a reusable Terraform module: locations, scheduled transfers, verification, bandwidth throttling, CloudWatch logging, and filter rules for AWS DataSync migrations. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "datasync" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-datasync?ref=v1.0.0"

  name                               = "..."  # Task name; seeds default log group and resource policy …
  source_s3_bucket_arn               = "..."  # ARN of the source S3 bucket.
  source_bucket_access_role_arn      = "..."  # IAM role DataSync assumes to read the source bucket.
  destination_s3_bucket_arn          = "..."  # ARN of the destination S3 bucket.
  destination_bucket_access_role_arn = "..."  # IAM role DataSync assumes to write the destination buck…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

AWS DataSync is a managed data-transfer service that moves files and objects between storage systems — on-premises NFS/SMB shares, self-managed object stores, and AWS storage like S3, EFS, and FSx — handling encryption in transit, integrity verification, retries, and incremental scans for you. A aws_datasync_task is the unit of work: it binds a source location to a destination location and carries the options that govern how the copy runs (verification mode, overwrite behaviour, file metadata preservation, bandwidth throttling) plus an optional schedule and include/exclude filters.

The trouble with raw DataSync is that a working transfer is rarely one resource. You need at least two aws_datasync_location_* resources, the task itself, a CloudWatch Log Group with the right resource policy so the agent can write logs, and a coherent block of task options that production teams almost always get wrong on the first pass (people forget OverwriteMode, leave VerifyMode on the expensive default, or never set LogLevel). This module wraps an S3-to-S3 (cross-account / cross-region) task — the most common managed-transfer shape — into one var-driven unit so every transfer in your estate is created the same way: consistently named, logged, optionally scheduled, and with throttling and filters exposed as inputs instead of hand-edited per task.

When to use it

If you need agent-based on-prem NFS/SMB transfers, you’d extend this pattern with aws_datasync_location_nfs/_smb and an aws_datasync_agent; the task wiring here is identical, only the location resources change.

Module structure

terraform-module-aws-datasync/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf
# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}
# main.tf

locals {
  task_name      = var.name
  log_group_name = coalesce(var.cloudwatch_log_group_name, "/aws/datasync/${var.name}")

  # DataSync only emits per-object/transfer logs when a log level is set AND a
  # log group ARN is attached. We expose the level but always wire the group.
  log_level = var.log_level
}

# ---------------------------------------------------------------------------
# CloudWatch Logs target + resource policy so the DataSync service can write
# ---------------------------------------------------------------------------
resource "aws_cloudwatch_log_group" "this" {
  count = var.create_log_group ? 1 : 0

  name              = local.log_group_name
  retention_in_days = var.log_retention_in_days
  kms_key_id        = var.log_kms_key_arn
  tags              = var.tags
}

data "aws_iam_policy_document" "log_resource_policy" {
  count = var.create_log_group ? 1 : 0

  statement {
    sid    = "DataSyncLogsToCloudWatch"
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["datasync.amazonaws.com"]
    }

    actions = [
      "logs:PutLogEvents",
      "logs:CreateLogStream",
    ]

    resources = ["${aws_cloudwatch_log_group.this[0].arn}:*"]
  }
}

resource "aws_cloudwatch_log_resource_policy" "this" {
  count = var.create_log_group ? 1 : 0

  policy_name     = "${var.name}-datasync-logs"
  policy_document = data.aws_iam_policy_document.log_resource_policy[0].json
}

# ---------------------------------------------------------------------------
# Source + destination S3 locations
# ---------------------------------------------------------------------------
resource "aws_datasync_location_s3" "source" {
  s3_bucket_arn = var.source_s3_bucket_arn
  subdirectory  = var.source_subdirectory

  s3_config {
    bucket_access_role_arn = var.source_bucket_access_role_arn
  }

  s3_storage_class = var.source_s3_storage_class
  tags             = var.tags
}

resource "aws_datasync_location_s3" "destination" {
  s3_bucket_arn = var.destination_s3_bucket_arn
  subdirectory  = var.destination_subdirectory

  s3_config {
    bucket_access_role_arn = var.destination_bucket_access_role_arn
  }

  s3_storage_class = var.destination_s3_storage_class
  tags             = var.tags
}

# ---------------------------------------------------------------------------
# The transfer task
# ---------------------------------------------------------------------------
resource "aws_datasync_task" "this" {
  name                     = local.task_name
  source_location_arn      = aws_datasync_location_s3.source.arn
  destination_location_arn = aws_datasync_location_s3.destination.arn

  cloudwatch_log_group_arn = var.create_log_group ? aws_cloudwatch_log_group.this[0].arn : var.cloudwatch_log_group_arn

  options {
    verify_mode            = var.verify_mode
    overwrite_mode         = var.overwrite_mode
    transfer_mode          = var.transfer_mode
    preserve_deleted_files = var.preserve_deleted_files
    bytes_per_second       = var.bytes_per_second
    task_queueing          = var.task_queueing
    log_level              = local.log_level
    posix_permissions      = "NONE"
    uid                    = "NONE"
    gid                    = "NONE"
    atime                  = "NONE"
    mtime                  = "NONE"
  }

  # Optional include/exclude filters. AWS DataSync joins multiple patterns in a
  # single rule with "|", so we collapse the provided list accordingly.
  dynamic "includes" {
    for_each = length(var.includes) > 0 ? [1] : []
    content {
      filter_type = "SIMPLE_PATTERN"
      value       = join("|", var.includes)
    }
  }

  dynamic "excludes" {
    for_each = length(var.excludes) > 0 ? [1] : []
    content {
      filter_type = "SIMPLE_PATTERN"
      value       = join("|", var.excludes)
    }
  }

  # Run on a cron schedule only when one is supplied.
  dynamic "schedule" {
    for_each = var.schedule_expression == null ? [] : [var.schedule_expression]
    content {
      schedule_expression = schedule.value
    }
  }

  tags = var.tags
}
# variables.tf

variable "name" {
  description = "Name for the DataSync task; also seeds the default log group and resource policy names."
  type        = string

  validation {
    condition     = can(regex("^[A-Za-z0-9_.-]{1,256}$", var.name))
    error_message = "name may contain only letters, numbers, and the characters _ . - and must be 1-256 chars."
  }
}

# --- Source location -------------------------------------------------------
variable "source_s3_bucket_arn" {
  description = "ARN of the source S3 bucket."
  type        = string

  validation {
    condition     = can(regex("^arn:aws[a-z-]*:s3:::", var.source_s3_bucket_arn))
    error_message = "source_s3_bucket_arn must be a valid S3 bucket ARN (arn:aws:s3:::bucket)."
  }
}

variable "source_subdirectory" {
  description = "Prefix within the source bucket to read from (must start with /)."
  type        = string
  default     = "/"

  validation {
    condition     = startswith(var.source_subdirectory, "/")
    error_message = "source_subdirectory must start with a forward slash."
  }
}

variable "source_bucket_access_role_arn" {
  description = "IAM role ARN DataSync assumes to read the source bucket (must trust datasync.amazonaws.com)."
  type        = string
}

variable "source_s3_storage_class" {
  description = "Storage class DataSync treats the source as. Usually STANDARD; set INTELLIGENT_TIERING/GLACIER only when reading restored objects."
  type        = string
  default     = "STANDARD"
}

# --- Destination location --------------------------------------------------
variable "destination_s3_bucket_arn" {
  description = "ARN of the destination S3 bucket."
  type        = string

  validation {
    condition     = can(regex("^arn:aws[a-z-]*:s3:::", var.destination_s3_bucket_arn))
    error_message = "destination_s3_bucket_arn must be a valid S3 bucket ARN (arn:aws:s3:::bucket)."
  }
}

variable "destination_subdirectory" {
  description = "Prefix within the destination bucket to write to (must start with /)."
  type        = string
  default     = "/"

  validation {
    condition     = startswith(var.destination_subdirectory, "/")
    error_message = "destination_subdirectory must start with a forward slash."
  }
}

variable "destination_bucket_access_role_arn" {
  description = "IAM role ARN DataSync assumes to write the destination bucket (must trust datasync.amazonaws.com)."
  type        = string
}

variable "destination_s3_storage_class" {
  description = "Storage class for objects written to the destination bucket."
  type        = string
  default     = "STANDARD"

  validation {
    condition = contains([
      "STANDARD", "STANDARD_IA", "ONEZONE_IA", "INTELLIGENT_TIERING",
      "GLACIER", "GLACIER_INSTANT_RETRIEVAL", "DEEP_ARCHIVE", "OUTPOSTS"
    ], var.destination_s3_storage_class)
    error_message = "destination_s3_storage_class is not a valid S3 DataSync storage class."
  }
}

# --- Task options ----------------------------------------------------------
variable "verify_mode" {
  description = "Integrity check: ONLY_FILES_TRANSFERRED (cheap, recommended), POINT_IN_TIME_CONSISTENT (full scan), or NONE."
  type        = string
  default     = "ONLY_FILES_TRANSFERRED"

  validation {
    condition     = contains(["ONLY_FILES_TRANSFERRED", "POINT_IN_TIME_CONSISTENT", "NONE"], var.verify_mode)
    error_message = "verify_mode must be ONLY_FILES_TRANSFERRED, POINT_IN_TIME_CONSISTENT, or NONE."
  }
}

variable "overwrite_mode" {
  description = "ALWAYS overwrites changed destination objects; NEVER skips objects that already exist."
  type        = string
  default     = "ALWAYS"

  validation {
    condition     = contains(["ALWAYS", "NEVER"], var.overwrite_mode)
    error_message = "overwrite_mode must be ALWAYS or NEVER."
  }
}

variable "transfer_mode" {
  description = "CHANGED copies only files that differ (incremental); ALL copies every file each run."
  type        = string
  default     = "CHANGED"

  validation {
    condition     = contains(["CHANGED", "ALL"], var.transfer_mode)
    error_message = "transfer_mode must be CHANGED or ALL."
  }
}

variable "preserve_deleted_files" {
  description = "PRESERVE keeps files in the destination that no longer exist in the source; REMOVE deletes them (true mirror)."
  type        = string
  default     = "PRESERVE"

  validation {
    condition     = contains(["PRESERVE", "REMOVE"], var.preserve_deleted_files)
    error_message = "preserve_deleted_files must be PRESERVE or REMOVE."
  }
}

variable "bytes_per_second" {
  description = "Bandwidth throttle in bytes/sec. -1 means unlimited (use the full link)."
  type        = number
  default     = -1

  validation {
    condition     = var.bytes_per_second == -1 || var.bytes_per_second >= 1048576
    error_message = "bytes_per_second must be -1 (unlimited) or at least 1048576 (1 MiB/s)."
  }
}

variable "task_queueing" {
  description = "ENABLED queues a new execution if one is already running instead of failing."
  type        = string
  default     = "ENABLED"

  validation {
    condition     = contains(["ENABLED", "DISABLED"], var.task_queueing)
    error_message = "task_queueing must be ENABLED or DISABLED."
  }
}

variable "log_level" {
  description = "DataSync log verbosity to CloudWatch: OFF, BASIC, or TRANSFER (per-object)."
  type        = string
  default     = "TRANSFER"

  validation {
    condition     = contains(["OFF", "BASIC", "TRANSFER"], var.log_level)
    error_message = "log_level must be OFF, BASIC, or TRANSFER."
  }
}

# --- Filters & schedule ----------------------------------------------------
variable "includes" {
  description = "List of SIMPLE_PATTERN globs to include (e.g. [\"/reports/*\", \"/2026/*\"]). Empty = include everything."
  type        = list(string)
  default     = []
}

variable "excludes" {
  description = "List of SIMPLE_PATTERN globs to exclude (e.g. [\"*/.snapshot/*\", \"*/_temp/*\"]). Empty = exclude nothing."
  type        = list(string)
  default     = []
}

variable "schedule_expression" {
  description = "Cron/rate expression to run the task automatically (e.g. \"cron(0 2 * * ? *)\"). null = no schedule (run on demand)."
  type        = string
  default     = null
}

# --- Logging plumbing ------------------------------------------------------
variable "create_log_group" {
  description = "Create and wire a CloudWatch Log Group (with the required resource policy). Set false to reuse an existing one via cloudwatch_log_group_arn."
  type        = bool
  default     = true
}

variable "cloudwatch_log_group_name" {
  description = "Override the managed log group name. Defaults to /aws/datasync/<name>."
  type        = string
  default     = null
}

variable "cloudwatch_log_group_arn" {
  description = "ARN of an existing log group to attach when create_log_group = false."
  type        = string
  default     = null
}

variable "log_retention_in_days" {
  description = "Retention for the managed log group."
  type        = number
  default     = 30
}

variable "log_kms_key_arn" {
  description = "Optional KMS key ARN to encrypt the managed log group."
  type        = string
  default     = null
}

variable "tags" {
  description = "Tags applied to all created resources."
  type        = map(string)
  default     = {}
}
# outputs.tf

output "task_arn" {
  description = "ARN of the DataSync task (use to start executions or build alarms)."
  value       = aws_datasync_task.this.arn
}

output "task_id" {
  description = "ID of the DataSync task."
  value       = aws_datasync_task.this.id
}

output "task_name" {
  description = "Name of the DataSync task."
  value       = aws_datasync_task.this.name
}

output "source_location_arn" {
  description = "ARN of the source S3 location."
  value       = aws_datasync_location_s3.source.arn
}

output "destination_location_arn" {
  description = "ARN of the destination S3 location."
  value       = aws_datasync_location_s3.destination.arn
}

output "cloudwatch_log_group_arn" {
  description = "ARN of the CloudWatch Log Group receiving transfer logs."
  value       = var.create_log_group ? aws_cloudwatch_log_group.this[0].arn : var.cloudwatch_log_group_arn
}

How to use it

module "datasync" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-datasync?ref=v1.0.0"

  name = "prod-lake-to-dr-nightly"

  # Source: production data lake (this account)
  source_s3_bucket_arn          = aws_s3_bucket.prod_lake.arn
  source_subdirectory           = "/curated/"
  source_bucket_access_role_arn = aws_iam_role.datasync_source.arn

  # Destination: DR bucket in another region/account
  destination_s3_bucket_arn          = "arn:aws:s3:::acme-dr-lake-apse2"
  destination_subdirectory           = "/curated/"
  destination_bucket_access_role_arn = aws_iam_role.datasync_destination.arn
  destination_s3_storage_class       = "STANDARD_IA"

  # Incremental nightly mirror, throttled to 100 MiB/s during the window
  transfer_mode          = "CHANGED"
  verify_mode            = "ONLY_FILES_TRANSFERRED"
  overwrite_mode         = "ALWAYS"
  preserve_deleted_files = "REMOVE"          # true mirror for DR
  bytes_per_second       = 104857600         # 100 MiB/s
  schedule_expression    = "cron(0 18 * * ? *)" # 18:00 UTC daily

  excludes = ["*/_temp/*", "*/.spark-staging/*"]

  log_level             = "TRANSFER"
  log_retention_in_days = 90

  tags = {
    Environment = "prod"
    Workload    = "dr"
    ManagedBy   = "terraform"
  }
}

# Downstream: alarm on a failed task execution using the task ARN output.
resource "aws_cloudwatch_metric_alarm" "datasync_failures" {
  alarm_name          = "datasync-${module.datasync.task_name}-failures"
  namespace           = "AWS/DataSync"
  metric_name         = "FilesFailed"
  statistic           = "Sum"
  comparison_operator = "GreaterThanThreshold"
  threshold           = 0
  period              = 3600
  evaluation_periods  = 1
  treat_missing_data  = "notBreaching"

  dimensions = {
    TaskId = module.datasync.task_id
  }

  alarm_actions = [aws_sns_topic.ops_alerts.arn]
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module configlive/prod/datasync/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-datasync?ref=v1.0.0"
}

inputs = {
  name = "..."
  source_s3_bucket_arn = "..."
  source_bucket_access_role_arn = "..."
  destination_s3_bucket_arn = "..."
  destination_bucket_access_role_arn = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/datasync && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
name string yes Task name; seeds default log group and resource policy names.
source_s3_bucket_arn string yes ARN of the source S3 bucket.
source_subdirectory string "/" no Prefix within the source bucket (must start with /).
source_bucket_access_role_arn string yes IAM role DataSync assumes to read the source bucket.
source_s3_storage_class string "STANDARD" no Storage class DataSync treats the source as.
destination_s3_bucket_arn string yes ARN of the destination S3 bucket.
destination_subdirectory string "/" no Prefix within the destination bucket (must start with /).
destination_bucket_access_role_arn string yes IAM role DataSync assumes to write the destination bucket.
destination_s3_storage_class string "STANDARD" no Storage class for written objects (validated against the allowed set).
verify_mode string "ONLY_FILES_TRANSFERRED" no Integrity check mode.
overwrite_mode string "ALWAYS" no ALWAYS or NEVER for changed destination objects.
transfer_mode string "CHANGED" no CHANGED (incremental) or ALL.
preserve_deleted_files string "PRESERVE" no PRESERVE or REMOVE (true mirror).
bytes_per_second number -1 no Bandwidth throttle; -1 = unlimited, else ≥ 1 MiB/s.
task_queueing string "ENABLED" no Queue overlapping executions instead of failing.
log_level string "TRANSFER" no OFF, BASIC, or TRANSFER.
includes list(string) [] no SIMPLE_PATTERN globs to include.
excludes list(string) [] no SIMPLE_PATTERN globs to exclude.
schedule_expression string null no Cron/rate expression; null = run on demand.
create_log_group bool true no Create + wire the log group and resource policy.
cloudwatch_log_group_name string null no Override managed log group name.
cloudwatch_log_group_arn string null no Existing log group ARN when create_log_group = false.
log_retention_in_days number 30 no Retention for the managed log group.
log_kms_key_arn string null no Optional KMS key for log encryption.
tags map(string) {} no Tags applied to all resources.

Outputs

Name Description
task_arn ARN of the DataSync task (start executions, build alarms).
task_id ID of the DataSync task (the TaskId CloudWatch dimension).
task_name Name of the DataSync task.
source_location_arn ARN of the source S3 location.
destination_location_arn ARN of the destination S3 location.
cloudwatch_log_group_arn ARN of the log group receiving transfer logs.

Enterprise scenario

A financial-services firm runs its regulated data lake in eu-west-1 and must hold an immutable, region-isolated DR copy in eu-central-1 with a 24-hour RPO. They instantiate this module once per lake zone (raw, curated, published), each with preserve_deleted_files = "REMOVE" for a faithful mirror, schedule_expression = "cron(0 18 * * ? *)", and bytes_per_second capped to 100 MiB/s so nightly replication never starves their batch ETL. Because every task writes per-object TRANSFER logs to a KMS-encrypted CloudWatch group retained for 90 days, audit can prove exactly which objects moved and were verified — and the FilesFailed alarm wired to the task_id output pages the on-call team before the RPO is breached.

Best practices

TerraformAWSDataSyncModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading