Quick take — Provision a locked-down Amazon SageMaker Studio domain with Terraform: VPC-only networking, KMS-encrypted EFS, IAM execution roles, and a default user profile — reusable across teams and environments. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "aws" {
region = "us-east-1"
}
module "sagemaker" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-sagemaker?ref=v1.0.0"
name_prefix = "..." # Prefix for all resource names; lowercase, 2-31 chars, s…
environment = "..." # One of `dev`, `staging`, `prod`; used in naming and tag…
vpc_id = "..." # VPC for the domain's ENIs.
subnet_ids = ["...", "..."] # Private subnet IDs (multi-AZ recommended).
kms_key_id = "..." # Customer-managed KMS key ARN encrypting the home EFS.
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Amazon SageMaker Studio is AWS’s managed, web-based IDE for the full machine-learning lifecycle — data prep, notebooks, training jobs, experiments, and model deployment. The unit you actually provision is a SageMaker domain (aws_sagemaker_domain): a long-lived, account- and region-scoped construct that owns the shared Amazon EFS file system backing every notebook, the networking mode, the default execution role, and the default settings every user profile inherits.
Creating a domain by hand is deceptively involved. You have to pick VpcOnly vs PublicInternetOnly, attach the right subnets and security groups, wire a KMS key onto the EFS volume, choose the auth mode (IAM vs IAM Identity Center), and supply a JupyterServer/Kernel-Gateway app configuration with sane default instance types — and a wrong choice on day one (say, leaving the domain on PublicInternetOnly) is effectively immutable, forcing a destroy-and-recreate that orphans EFS data.
This module wraps aws_sagemaker_domain together with the sub-resources teams almost always need in production — a scoped IAM execution role, a VpcOnly network posture with a dedicated security group, KMS encryption of the home EFS, and a default aws_sagemaker_user_profile — behind a small, validated variable surface. You get a repeatable, policy-compliant ML workspace per environment instead of a hand-clicked one-off in the console.
When to use it
- You need a standardized SageMaker Studio domain per environment (dev / staging / prod) or per data-science team, with identical guardrails each time.
- Your security posture mandates no direct internet egress from notebooks (
VpcOnly) with traffic forced through VPC endpoints or a NAT/proxy. - You want encryption-at-rest with a customer-managed KMS key on the notebook EFS volume, not the AWS-owned default.
- You’re standing up a landing zone / platform where ML workspaces are vended to teams and must be auditable and reproducible.
- You do not need this for ad-hoc, throwaway experimentation where a console-created domain you delete the same day is fine — the value here is governance and repeatability.
Module structure
terraform-module-aws-sagemaker/
├── versions.tf # provider + Terraform version pins
├── main.tf # IAM role, security group, domain, default user profile
├── variables.tf # validated input surface
└── outputs.tf # ids, ARNs, EFS id, domain URL
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
main.tf
locals {
name = "${var.name_prefix}-${var.environment}"
tags = merge(
{
Module = "terraform-module-aws-sagemaker"
Environment = var.environment
ManagedBy = "Terraform"
},
var.tags,
)
}
# ----------------------------------------------------------------------------
# Execution role assumed by SageMaker Studio apps, training & processing jobs
# ----------------------------------------------------------------------------
data "aws_iam_policy_document" "assume" {
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["sagemaker.amazonaws.com"]
}
}
}
resource "aws_iam_role" "execution" {
name = "${local.name}-sagemaker-exec"
assume_role_policy = data.aws_iam_policy_document.assume.json
permissions_boundary = var.permissions_boundary_arn
tags = local.tags
}
# Broad managed policy is convenient for getting started; scope it down in prod
# by setting attach_full_access = false and supplying managed_policy_arns.
resource "aws_iam_role_policy_attachment" "full_access" {
count = var.attach_full_access ? 1 : 0
role = aws_iam_role.execution.name
policy_arn = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
}
resource "aws_iam_role_policy_attachment" "extra" {
for_each = toset(var.managed_policy_arns)
role = aws_iam_role.execution.name
policy_arn = each.value
}
# ----------------------------------------------------------------------------
# Dedicated security group for the domain's ENIs (VpcOnly mode)
# ----------------------------------------------------------------------------
resource "aws_security_group" "domain" {
name = "${local.name}-sagemaker-sg"
description = "SageMaker Studio domain ENIs (${local.name})"
vpc_id = var.vpc_id
tags = local.tags
}
# NFS within the SG so JupyterServer/KernelGateway apps can reach the home EFS.
resource "aws_vpc_security_group_ingress_rule" "efs_self" {
security_group_id = aws_security_group.domain.id
description = "EFS (NFS) between Studio apps"
from_port = 2049
to_port = 2049
ip_protocol = "tcp"
referenced_security_group_id = aws_security_group.domain.id
}
# Egress to the VPC CIDR only — VpcOnly forces traffic through VPC endpoints/NAT.
resource "aws_vpc_security_group_egress_rule" "vpc_egress" {
for_each = toset(var.egress_cidr_blocks)
security_group_id = aws_security_group.domain.id
description = "Egress to approved CIDR"
ip_protocol = "-1"
cidr_ipv4 = each.value
}
# ----------------------------------------------------------------------------
# The SageMaker Studio domain
# ----------------------------------------------------------------------------
resource "aws_sagemaker_domain" "this" {
domain_name = local.name
auth_mode = var.auth_mode
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
app_network_access_type = "VpcOnly"
kms_key_id = var.kms_key_id
default_user_settings {
execution_role = aws_iam_role.execution.arn
security_groups = concat([aws_security_group.domain.id], var.extra_security_group_ids)
jupyter_server_app_settings {
default_resource_spec {
instance_type = var.jupyter_instance_type
}
}
kernel_gateway_app_settings {
default_resource_spec {
instance_type = var.kernel_gateway_instance_type
}
}
}
retention_policy {
home_efs_file_system = var.home_efs_retention
}
tags = local.tags
}
# ----------------------------------------------------------------------------
# Optional default user profile so the domain is usable immediately
# ----------------------------------------------------------------------------
resource "aws_sagemaker_user_profile" "default" {
count = var.create_default_user_profile ? 1 : 0
domain_id = aws_sagemaker_domain.this.id
user_profile_name = var.default_user_profile_name
user_settings {
execution_role = aws_iam_role.execution.arn
}
tags = local.tags
}
variables.tf
variable "name_prefix" {
description = "Prefix for all resource names (e.g. team or product)."
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{1,30}$", var.name_prefix))
error_message = "name_prefix must be lowercase alphanumeric/hyphens, 2-31 chars, starting with a letter."
}
}
variable "environment" {
description = "Deployment environment, used in naming and tags."
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "environment must be one of: dev, staging, prod."
}
}
variable "vpc_id" {
description = "VPC in which the Studio domain ENIs are placed."
type = string
}
variable "subnet_ids" {
description = "Private subnet IDs for the domain (multi-AZ recommended)."
type = list(string)
validation {
condition = length(var.subnet_ids) >= 1
error_message = "Provide at least one subnet_id."
}
}
variable "kms_key_id" {
description = "ARN of a customer-managed KMS key encrypting the home EFS volume."
type = string
validation {
condition = can(regex("^arn:aws[a-z-]*:kms:", var.kms_key_id))
error_message = "kms_key_id must be a KMS key ARN (arn:aws:kms:...)."
}
}
variable "auth_mode" {
description = "Domain auth mode: IAM or SSO (IAM Identity Center)."
type = string
default = "IAM"
validation {
condition = contains(["IAM", "SSO"], var.auth_mode)
error_message = "auth_mode must be IAM or SSO."
}
}
variable "jupyter_instance_type" {
description = "Default instance type for the JupyterServer app."
type = string
default = "system"
}
variable "kernel_gateway_instance_type" {
description = "Default instance type for KernelGateway notebook kernels."
type = string
default = "ml.t3.medium"
validation {
condition = startswith(var.kernel_gateway_instance_type, "ml.")
error_message = "kernel_gateway_instance_type must be a SageMaker ml.* instance type."
}
}
variable "home_efs_retention" {
description = "What to do with the home EFS on domain delete: Retain or Delete."
type = string
default = "Retain"
validation {
condition = contains(["Retain", "Delete"], var.home_efs_retention)
error_message = "home_efs_retention must be Retain or Delete."
}
}
variable "attach_full_access" {
description = "Attach AmazonSageMakerFullAccess to the execution role. Disable and scope down for prod."
type = bool
default = true
}
variable "managed_policy_arns" {
description = "Additional managed policy ARNs to attach to the execution role."
type = list(string)
default = []
}
variable "permissions_boundary_arn" {
description = "Optional IAM permissions boundary for the execution role."
type = string
default = null
}
variable "extra_security_group_ids" {
description = "Additional security groups to attach to Studio apps."
type = list(string)
default = []
}
variable "egress_cidr_blocks" {
description = "CIDR blocks the domain SG may egress to (e.g. VPC CIDR for endpoints/NAT)."
type = list(string)
default = ["10.0.0.0/8"]
}
variable "create_default_user_profile" {
description = "Create a starter user profile so the domain is usable immediately."
type = bool
default = true
}
variable "default_user_profile_name" {
description = "Name of the default user profile."
type = string
default = "default-user"
}
variable "tags" {
description = "Extra tags merged onto all resources."
type = map(string)
default = {}
}
outputs.tf
output "domain_id" {
description = "The SageMaker Studio domain ID."
value = aws_sagemaker_domain.this.id
}
output "domain_arn" {
description = "ARN of the SageMaker domain."
value = aws_sagemaker_domain.this.arn
}
output "domain_url" {
description = "Single-sign-on / IAM login URL for Studio."
value = aws_sagemaker_domain.this.url
}
output "home_efs_file_system_id" {
description = "EFS file system backing the domain's user storage."
value = aws_sagemaker_domain.this.home_efs_file_system_id
}
output "execution_role_arn" {
description = "ARN of the SageMaker execution role (reuse for jobs/pipelines)."
value = aws_iam_role.execution.arn
}
output "security_group_id" {
description = "Security group attached to the domain's ENIs."
value = aws_security_group.domain.id
}
output "default_user_profile_name" {
description = "Name of the default user profile, if created."
value = try(aws_sagemaker_user_profile.default[0].user_profile_name, null)
}
How to use it
module "sagemaker" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-sagemaker?ref=v1.0.0"
name_prefix = "ml-platform"
environment = "prod"
vpc_id = module.network.vpc_id
subnet_ids = module.network.private_subnet_ids
kms_key_id = aws_kms_key.sagemaker.arn
auth_mode = "SSO"
kernel_gateway_instance_type = "ml.m5.xlarge"
home_efs_retention = "Retain"
# Lock IAM down in prod: drop the broad managed policy, attach scoped ones.
attach_full_access = false
managed_policy_arns = [aws_iam_policy.sagemaker_scoped.arn]
# Only the VPC CIDR is reachable; egress flows via interface endpoints.
egress_cidr_blocks = [module.network.vpc_cidr]
tags = {
CostCenter = "data-science"
Owner = "ml-platform-team"
}
}
# Downstream: reuse the same execution role for a training-pipeline definition,
# and grant the role read access to the curated-data bucket.
resource "aws_s3_bucket_policy" "training_data_access" {
bucket = aws_s3_bucket.curated.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Sid = "SageMakerExecRead"
Effect = "Allow"
Principal = { AWS = module.sagemaker.execution_role_arn }
Action = ["s3:GetObject", "s3:ListBucket"]
Resource = [
aws_s3_bucket.curated.arn,
"${aws_s3_bucket.curated.arn}/*",
]
}]
})
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "s3"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...s3 state bucket/container + key per path...
}
}
2. Module config — live/prod/sagemaker/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-sagemaker?ref=v1.0.0"
}
inputs = {
name_prefix = "..."
environment = "..."
vpc_id = "..."
subnet_ids = ["...", "..."]
kms_key_id = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/sagemaker && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
name_prefix |
string |
— | Yes | Prefix for all resource names; lowercase, 2-31 chars, starts with a letter. |
environment |
string |
— | Yes | One of dev, staging, prod; used in naming and tags. |
vpc_id |
string |
— | Yes | VPC for the domain’s ENIs. |
subnet_ids |
list(string) |
— | Yes | Private subnet IDs (multi-AZ recommended). |
kms_key_id |
string |
— | Yes | Customer-managed KMS key ARN encrypting the home EFS. |
auth_mode |
string |
"IAM" |
No | IAM or SSO (IAM Identity Center). |
jupyter_instance_type |
string |
"system" |
No | Default instance type for the JupyterServer app. |
kernel_gateway_instance_type |
string |
"ml.t3.medium" |
No | Default kernel instance type; must be an ml.* type. |
home_efs_retention |
string |
"Retain" |
No | Retain or Delete the home EFS on domain deletion. |
attach_full_access |
bool |
true |
No | Attach AmazonSageMakerFullAccess; disable to scope IAM in prod. |
managed_policy_arns |
list(string) |
[] |
No | Extra managed policy ARNs for the execution role. |
permissions_boundary_arn |
string |
null |
No | Optional permissions boundary for the execution role. |
extra_security_group_ids |
list(string) |
[] |
No | Additional security groups attached to Studio apps. |
egress_cidr_blocks |
list(string) |
["10.0.0.0/8"] |
No | CIDRs the domain SG may egress to. |
create_default_user_profile |
bool |
true |
No | Create a starter user profile. |
default_user_profile_name |
string |
"default-user" |
No | Name of the default user profile. |
tags |
map(string) |
{} |
No | Extra tags merged onto all resources. |
Outputs
| Name | Description |
|---|---|
domain_id |
The SageMaker Studio domain ID. |
domain_arn |
ARN of the domain. |
domain_url |
SSO / IAM login URL for Studio. |
home_efs_file_system_id |
EFS file system backing user storage. |
execution_role_arn |
Execution role ARN, reusable for jobs and pipelines. |
security_group_id |
Security group attached to the domain’s ENIs. |
default_user_profile_name |
Name of the default user profile, or null if not created. |
Enterprise scenario
A regulated fintech runs a central ML platform team that vends a SageMaker Studio domain to each product squad. They call this module once per squad-environment from a landing-zone repo, passing the squad’s private subnets and a per-squad KMS key, with attach_full_access = false and a scoped policy that only permits the squad’s S3 prefix and ECR repos. Because app_network_access_type is hard-pinned to VpcOnly and egress is limited to the VPC CIDR, notebooks reach S3, ECR, and CloudWatch exclusively through interface VPC endpoints — satisfying the auditor’s “no direct internet from data-science workloads” control while keeping every domain identical and reproducible in CI.
Best practices
- Pin
VpcOnlyand never relax it. The network access type is effectively immutable; recreating a domain orphans the EFS and disrupts every user. This module fixes it toVpcOnlyand pairs it with interface VPC endpoints (S3, STS, SageMaker API/Runtime, ECR, CloudWatch) so notebooks have no internet path. - Use a customer-managed KMS key for the home EFS. Setting
kms_key_idkeeps notebook contents under a key you control and can audit; the default AWS-owned key can’t be rotated or restricted per team. Mirror that key on training/processing job output too. - Scope the execution role for production.
AmazonSageMakerFullAccessis broad — setattach_full_access = falseand pass least-privilegemanaged_policy_arnsplus apermissions_boundary_arn, granting only the specific S3 prefixes, ECR repos, and KMS keys each team needs. - Set
home_efs_retention = "Retain"outside dev. Aterraform destroywithDeletepermanently removes every user’s notebooks and data; retaining the EFS protects against accidental teardown of shared work. - Control cost via default instance types and right-sizing. Pick a modest
kernel_gateway_instance_type(e.g.ml.t3.medium) as the default and reserve large GPU types for explicit job specs — idle KernelGateway apps bill per running hour, so encourage users to shut down apps and consider lifecycle automation. - Name and tag consistently. The
name_prefix/environmentconvention plus mergedtags(CostCenter,Owner,Environment) make domains, roles, and EFS volumes attributable in Cost Explorer and easy to find when a squad’s workspace needs attention.