IaC AWS

Terraform Module: AWS SageMaker — a governed, VPC-only ML Studio domain in one block

Quick take — Provision a locked-down Amazon SageMaker Studio domain with Terraform: VPC-only networking, KMS-encrypted EFS, IAM execution roles, and a default user profile — reusable across teams and environments. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "sagemaker" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-sagemaker?ref=v1.0.0"

  name_prefix = "..."           # Prefix for all resource names; lowercase, 2-31 chars, s…
  environment = "..."           # One of `dev`, `staging`, `prod`; used in naming and tag…
  vpc_id      = "..."           # VPC for the domain's ENIs.
  subnet_ids  = ["...", "..."]  # Private subnet IDs (multi-AZ recommended).
  kms_key_id  = "..."           # Customer-managed KMS key ARN encrypting the home EFS.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Amazon SageMaker Studio is AWS’s managed, web-based IDE for the full machine-learning lifecycle — data prep, notebooks, training jobs, experiments, and model deployment. The unit you actually provision is a SageMaker domain (aws_sagemaker_domain): a long-lived, account- and region-scoped construct that owns the shared Amazon EFS file system backing every notebook, the networking mode, the default execution role, and the default settings every user profile inherits.

Creating a domain by hand is deceptively involved. You have to pick VpcOnly vs PublicInternetOnly, attach the right subnets and security groups, wire a KMS key onto the EFS volume, choose the auth mode (IAM vs IAM Identity Center), and supply a JupyterServer/Kernel-Gateway app configuration with sane default instance types — and a wrong choice on day one (say, leaving the domain on PublicInternetOnly) is effectively immutable, forcing a destroy-and-recreate that orphans EFS data.

This module wraps aws_sagemaker_domain together with the sub-resources teams almost always need in production — a scoped IAM execution role, a VpcOnly network posture with a dedicated security group, KMS encryption of the home EFS, and a default aws_sagemaker_user_profile — behind a small, validated variable surface. You get a repeatable, policy-compliant ML workspace per environment instead of a hand-clicked one-off in the console.

When to use it

Module structure

terraform-module-aws-sagemaker/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # IAM role, security group, domain, default user profile
├── variables.tf     # validated input surface
└── outputs.tf       # ids, ARNs, EFS id, domain URL

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  name = "${var.name_prefix}-${var.environment}"

  tags = merge(
    {
      Module      = "terraform-module-aws-sagemaker"
      Environment = var.environment
      ManagedBy   = "Terraform"
    },
    var.tags,
  )
}

# ----------------------------------------------------------------------------
# Execution role assumed by SageMaker Studio apps, training & processing jobs
# ----------------------------------------------------------------------------
data "aws_iam_policy_document" "assume" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["sagemaker.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "execution" {
  name                 = "${local.name}-sagemaker-exec"
  assume_role_policy   = data.aws_iam_policy_document.assume.json
  permissions_boundary = var.permissions_boundary_arn
  tags                 = local.tags
}

# Broad managed policy is convenient for getting started; scope it down in prod
# by setting attach_full_access = false and supplying managed_policy_arns.
resource "aws_iam_role_policy_attachment" "full_access" {
  count      = var.attach_full_access ? 1 : 0
  role       = aws_iam_role.execution.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
}

resource "aws_iam_role_policy_attachment" "extra" {
  for_each   = toset(var.managed_policy_arns)
  role       = aws_iam_role.execution.name
  policy_arn = each.value
}

# ----------------------------------------------------------------------------
# Dedicated security group for the domain's ENIs (VpcOnly mode)
# ----------------------------------------------------------------------------
resource "aws_security_group" "domain" {
  name        = "${local.name}-sagemaker-sg"
  description = "SageMaker Studio domain ENIs (${local.name})"
  vpc_id      = var.vpc_id
  tags        = local.tags
}

# NFS within the SG so JupyterServer/KernelGateway apps can reach the home EFS.
resource "aws_vpc_security_group_ingress_rule" "efs_self" {
  security_group_id            = aws_security_group.domain.id
  description                  = "EFS (NFS) between Studio apps"
  from_port                    = 2049
  to_port                      = 2049
  ip_protocol                  = "tcp"
  referenced_security_group_id = aws_security_group.domain.id
}

# Egress to the VPC CIDR only — VpcOnly forces traffic through VPC endpoints/NAT.
resource "aws_vpc_security_group_egress_rule" "vpc_egress" {
  for_each = toset(var.egress_cidr_blocks)

  security_group_id = aws_security_group.domain.id
  description       = "Egress to approved CIDR"
  ip_protocol       = "-1"
  cidr_ipv4         = each.value
}

# ----------------------------------------------------------------------------
# The SageMaker Studio domain
# ----------------------------------------------------------------------------
resource "aws_sagemaker_domain" "this" {
  domain_name             = local.name
  auth_mode               = var.auth_mode
  vpc_id                  = var.vpc_id
  subnet_ids              = var.subnet_ids
  app_network_access_type = "VpcOnly"
  kms_key_id              = var.kms_key_id

  default_user_settings {
    execution_role  = aws_iam_role.execution.arn
    security_groups = concat([aws_security_group.domain.id], var.extra_security_group_ids)

    jupyter_server_app_settings {
      default_resource_spec {
        instance_type = var.jupyter_instance_type
      }
    }

    kernel_gateway_app_settings {
      default_resource_spec {
        instance_type = var.kernel_gateway_instance_type
      }
    }
  }

  retention_policy {
    home_efs_file_system = var.home_efs_retention
  }

  tags = local.tags
}

# ----------------------------------------------------------------------------
# Optional default user profile so the domain is usable immediately
# ----------------------------------------------------------------------------
resource "aws_sagemaker_user_profile" "default" {
  count             = var.create_default_user_profile ? 1 : 0
  domain_id         = aws_sagemaker_domain.this.id
  user_profile_name = var.default_user_profile_name

  user_settings {
    execution_role = aws_iam_role.execution.arn
  }

  tags = local.tags
}

variables.tf

variable "name_prefix" {
  description = "Prefix for all resource names (e.g. team or product)."
  type        = string

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{1,30}$", var.name_prefix))
    error_message = "name_prefix must be lowercase alphanumeric/hyphens, 2-31 chars, starting with a letter."
  }
}

variable "environment" {
  description = "Deployment environment, used in naming and tags."
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "environment must be one of: dev, staging, prod."
  }
}

variable "vpc_id" {
  description = "VPC in which the Studio domain ENIs are placed."
  type        = string
}

variable "subnet_ids" {
  description = "Private subnet IDs for the domain (multi-AZ recommended)."
  type        = list(string)

  validation {
    condition     = length(var.subnet_ids) >= 1
    error_message = "Provide at least one subnet_id."
  }
}

variable "kms_key_id" {
  description = "ARN of a customer-managed KMS key encrypting the home EFS volume."
  type        = string

  validation {
    condition     = can(regex("^arn:aws[a-z-]*:kms:", var.kms_key_id))
    error_message = "kms_key_id must be a KMS key ARN (arn:aws:kms:...)."
  }
}

variable "auth_mode" {
  description = "Domain auth mode: IAM or SSO (IAM Identity Center)."
  type        = string
  default     = "IAM"

  validation {
    condition     = contains(["IAM", "SSO"], var.auth_mode)
    error_message = "auth_mode must be IAM or SSO."
  }
}

variable "jupyter_instance_type" {
  description = "Default instance type for the JupyterServer app."
  type        = string
  default     = "system"
}

variable "kernel_gateway_instance_type" {
  description = "Default instance type for KernelGateway notebook kernels."
  type        = string
  default     = "ml.t3.medium"

  validation {
    condition     = startswith(var.kernel_gateway_instance_type, "ml.")
    error_message = "kernel_gateway_instance_type must be a SageMaker ml.* instance type."
  }
}

variable "home_efs_retention" {
  description = "What to do with the home EFS on domain delete: Retain or Delete."
  type        = string
  default     = "Retain"

  validation {
    condition     = contains(["Retain", "Delete"], var.home_efs_retention)
    error_message = "home_efs_retention must be Retain or Delete."
  }
}

variable "attach_full_access" {
  description = "Attach AmazonSageMakerFullAccess to the execution role. Disable and scope down for prod."
  type        = bool
  default     = true
}

variable "managed_policy_arns" {
  description = "Additional managed policy ARNs to attach to the execution role."
  type        = list(string)
  default     = []
}

variable "permissions_boundary_arn" {
  description = "Optional IAM permissions boundary for the execution role."
  type        = string
  default     = null
}

variable "extra_security_group_ids" {
  description = "Additional security groups to attach to Studio apps."
  type        = list(string)
  default     = []
}

variable "egress_cidr_blocks" {
  description = "CIDR blocks the domain SG may egress to (e.g. VPC CIDR for endpoints/NAT)."
  type        = list(string)
  default     = ["10.0.0.0/8"]
}

variable "create_default_user_profile" {
  description = "Create a starter user profile so the domain is usable immediately."
  type        = bool
  default     = true
}

variable "default_user_profile_name" {
  description = "Name of the default user profile."
  type        = string
  default     = "default-user"
}

variable "tags" {
  description = "Extra tags merged onto all resources."
  type        = map(string)
  default     = {}
}

outputs.tf

output "domain_id" {
  description = "The SageMaker Studio domain ID."
  value       = aws_sagemaker_domain.this.id
}

output "domain_arn" {
  description = "ARN of the SageMaker domain."
  value       = aws_sagemaker_domain.this.arn
}

output "domain_url" {
  description = "Single-sign-on / IAM login URL for Studio."
  value       = aws_sagemaker_domain.this.url
}

output "home_efs_file_system_id" {
  description = "EFS file system backing the domain's user storage."
  value       = aws_sagemaker_domain.this.home_efs_file_system_id
}

output "execution_role_arn" {
  description = "ARN of the SageMaker execution role (reuse for jobs/pipelines)."
  value       = aws_iam_role.execution.arn
}

output "security_group_id" {
  description = "Security group attached to the domain's ENIs."
  value       = aws_security_group.domain.id
}

output "default_user_profile_name" {
  description = "Name of the default user profile, if created."
  value       = try(aws_sagemaker_user_profile.default[0].user_profile_name, null)
}

How to use it

module "sagemaker" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-sagemaker?ref=v1.0.0"

  name_prefix = "ml-platform"
  environment = "prod"

  vpc_id     = module.network.vpc_id
  subnet_ids = module.network.private_subnet_ids
  kms_key_id = aws_kms_key.sagemaker.arn

  auth_mode                    = "SSO"
  kernel_gateway_instance_type = "ml.m5.xlarge"
  home_efs_retention           = "Retain"

  # Lock IAM down in prod: drop the broad managed policy, attach scoped ones.
  attach_full_access  = false
  managed_policy_arns = [aws_iam_policy.sagemaker_scoped.arn]

  # Only the VPC CIDR is reachable; egress flows via interface endpoints.
  egress_cidr_blocks = [module.network.vpc_cidr]

  tags = {
    CostCenter = "data-science"
    Owner      = "ml-platform-team"
  }
}

# Downstream: reuse the same execution role for a training-pipeline definition,
# and grant the role read access to the curated-data bucket.
resource "aws_s3_bucket_policy" "training_data_access" {
  bucket = aws_s3_bucket.curated.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Sid       = "SageMakerExecRead"
      Effect    = "Allow"
      Principal = { AWS = module.sagemaker.execution_role_arn }
      Action    = ["s3:GetObject", "s3:ListBucket"]
      Resource = [
        aws_s3_bucket.curated.arn,
        "${aws_s3_bucket.curated.arn}/*",
      ]
    }]
  })
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module configlive/prod/sagemaker/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-sagemaker?ref=v1.0.0"
}

inputs = {
  name_prefix = "..."
  environment = "..."
  vpc_id = "..."
  subnet_ids = ["...", "..."]
  kms_key_id = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/sagemaker && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
name_prefix string Yes Prefix for all resource names; lowercase, 2-31 chars, starts with a letter.
environment string Yes One of dev, staging, prod; used in naming and tags.
vpc_id string Yes VPC for the domain’s ENIs.
subnet_ids list(string) Yes Private subnet IDs (multi-AZ recommended).
kms_key_id string Yes Customer-managed KMS key ARN encrypting the home EFS.
auth_mode string "IAM" No IAM or SSO (IAM Identity Center).
jupyter_instance_type string "system" No Default instance type for the JupyterServer app.
kernel_gateway_instance_type string "ml.t3.medium" No Default kernel instance type; must be an ml.* type.
home_efs_retention string "Retain" No Retain or Delete the home EFS on domain deletion.
attach_full_access bool true No Attach AmazonSageMakerFullAccess; disable to scope IAM in prod.
managed_policy_arns list(string) [] No Extra managed policy ARNs for the execution role.
permissions_boundary_arn string null No Optional permissions boundary for the execution role.
extra_security_group_ids list(string) [] No Additional security groups attached to Studio apps.
egress_cidr_blocks list(string) ["10.0.0.0/8"] No CIDRs the domain SG may egress to.
create_default_user_profile bool true No Create a starter user profile.
default_user_profile_name string "default-user" No Name of the default user profile.
tags map(string) {} No Extra tags merged onto all resources.

Outputs

Name Description
domain_id The SageMaker Studio domain ID.
domain_arn ARN of the domain.
domain_url SSO / IAM login URL for Studio.
home_efs_file_system_id EFS file system backing user storage.
execution_role_arn Execution role ARN, reusable for jobs and pipelines.
security_group_id Security group attached to the domain’s ENIs.
default_user_profile_name Name of the default user profile, or null if not created.

Enterprise scenario

A regulated fintech runs a central ML platform team that vends a SageMaker Studio domain to each product squad. They call this module once per squad-environment from a landing-zone repo, passing the squad’s private subnets and a per-squad KMS key, with attach_full_access = false and a scoped policy that only permits the squad’s S3 prefix and ECR repos. Because app_network_access_type is hard-pinned to VpcOnly and egress is limited to the VPC CIDR, notebooks reach S3, ECR, and CloudWatch exclusively through interface VPC endpoints — satisfying the auditor’s “no direct internet from data-science workloads” control while keeping every domain identical and reproducible in CI.

Best practices

TerraformAWSSageMakerModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading