AWS Career

AWS Capstone: Build a Well-Architected Multi-Account Landing Zone + 3-Tier App

This is the capstone of the AWS Zero-to-Hero course. Everything you have learned — the global infrastructure and account model, IAM and the policy-evaluation logic, VPC networking, compute and databases, security, observability and troubleshooting — now converges into one project that proves end-to-end skill: you will build a governed multi-account landing zone and deploy a production 3-tier application onto it, then review the entire result against the six pillars of the AWS Well-Architected Framework. A landing zone is the pre-built, secured, multi-account environment that workloads “land” in — networking, identity, guardrails, logging and cost controls wired up first, so that when a team arrives they inherit security and consistency on day one instead of reinventing it (badly) on every project.

We will work exactly the way a real platform team works: start from a business brief, make explicit design decisions you can defend in a review, then build in staged phases — each phase validated before the next, and each one pointing at a deeper KloudVin lesson for production detail beyond what a single capstone can hold. You will finish with a small but genuinely real environment you can run in your own account, a set of acceptance criteria to prove it works, a Well-Architected review scored pillar by pillar, and a project you can put on your CV and talk through in an interview with total confidence.

Learning objectives

By the end of this capstone you can:

Prerequisites

This is the final, Advanced lesson of the AWS Zero-to-Hero course and it assumes the whole course. You should be comfortable with the account and Organizations model, the IAM policy-evaluation logic (explicit deny beats allow beats implicit deny), VPC fundamentals (subnets, route tables, internet and NAT gateways, security groups), the core compute and database services, and driving AWS from the CloudShell/aws CLI. If any of those feel shaky, work the earlier lessons first — this capstone links back to them at each phase rather than re-teaching them. For the hands-on lab you need one AWS account with administrator access and the AWS CLI v2 configured; the full multi-account build is described throughout and modelled in a single account where personal-account limits apply. Everything in the lab stays within or close to the Free Tier if you clean up the same day.

Core concept: what a landing zone is and why six pillars

A landing zone is not a product — it is a governed starting point. The mistake juniors make is to treat “build the app” as the whole job; the architect’s job is to build the environment the app is safe to run in first. Concretely a landing zone gives you four things before any workload exists: a multi-account structure so blast radius and billing are bounded; a central identity plane so humans get short-lived, least-privilege access instead of long-lived keys; guardrails (preventive and detective) that make the wrong thing hard and the right thing automatic; and a shared network, logging and cost baseline so every workload inherits connectivity, an audit trail and cost attribution by default.

The Well-Architected Framework is the lens we review it through. It is AWS’s distilled body of architectural best practice, organised into six pillars. Knowing them cold is both an exam requirement and the language design reviews are conducted in.

Pillar Core question it answers What it looks like in this capstone
Operational Excellence Can you run, observe and improve the system? IaC for everything, CloudWatch dashboards/alarms, CloudTrail, runbooks
Security Is access least-privilege and is data protected? Multi-account, Identity Center, SCPs, GuardDuty, KMS, Secrets Manager, private subnets
Reliability Does it survive failure and recover? Multi-AZ RDS, Fargate across AZs behind an ALB, backups, health checks
Performance Efficiency Are resources right-sized and elastic? Fargate auto scaling, Graviton, right-sized RDS, ALB
Cost Optimization Are you paying only for what you need? Budgets, tags, Fargate Spot for non-prod, Savings Plans, lifecycle
Sustainability Are you minimising the resources consumed? Graviton, scaling to demand, Spot, efficient storage tiers

We design for the pillars as we build, then review against them at the end — the same loop a real Well-Architected Review (WAR) follows.

The brief

Our fictional company is Meridian Retail, a mid-size e-commerce firm moving from a single hand-built AWS account (one engineer clicked it together, the root user has an access key, everything shares one VPC) to a governed multi-account foundation hosting their order-management web application. Leadership wants, in their words:

  1. “Stop the wild west.” One account for everything is over. Production must be isolated from development, the root user must never be used day to day, and every resource must be owned, tagged and logged.
  2. “Ship the order app safely.” A standard internet-facing 3-tier web app (load balancer → application → database) that survives the loss of a data centre, keeps the database private, and rolls out new versions without dropping customer traffic.
  3. “No surprises on the bill or in an audit.” Finance wants spend attributable per team and environment with alerts before a budget blows; Security wants a complete audit trail and automatic threat detection across every account.

Translated into platform language, Meridian needs: a multi-account structure under AWS Organizations with an OU hierarchy; Control Tower to stand up and govern it; IAM Identity Center for federated, least-privilege human access; a hub-and-spoke network with centralised egress via Transit Gateway; a workload of ALB + ECS Fargate + RDS Multi-AZ in private subnets; preventive guardrails (SCPs) and detective controls (GuardDuty, CloudTrail) plus encryption (KMS) and secret management (Secrets Manager); a central observability baseline; DR through Multi-AZ and backups; and cost controls. That is precisely a Well-Architected landing zone with a workload on it.

Design decisions

A landing zone is mostly a set of decisions; the implementation is easy once they are explicit and defensible. Here are the eight that matter, each with the reasoning a reviewer expects and the deeper lesson that owns it.

1. Account structure and OUs

Decision: adopt a multi-account structure governed by AWS Organizations, laid out with the Control Tower reference OU hierarchy rather than a flat set of accounts. Accounts are the strongest isolation and billing boundary AWS offers — far stronger than VPCs or IAM within one account — so we separate by function and environment. A management account holds the Organization and billing and runs no workloads; a Security OU holds the Log Archive and Audit accounts; a Workloads OU splits into Prod and Non-Prod child OUs holding the application accounts.

Root
├── Management account            (Organization, billing — no workloads)
├── Security OU
│   ├── Log Archive account       (immutable central CloudTrail/Config logs)
│   └── Audit account             (security tooling, GuardDuty admin)
├── Infrastructure OU
│   └── Network account           (Transit Gateway, central egress, DNS)
└── Workloads OU
    ├── Prod OU
    │   └── meridian-prod account (the order app — production)
    └── Non-Prod OU
        └── meridian-dev account  (the order app — development)

A guardrail (SCP) attached to the Workloads OU flows to Prod, Non-Prod and every future account beneath them, so new teams inherit governance automatically. The management account is kept clean because anything granted there is hard to constrain — SCPs do not apply to it. Detail: Building a Multi-Account AWS Landing Zone with Control Tower.

2. Identity: IAM Identity Center, not IAM users

Decision: humans never get IAM users or long-lived access keys. IAM Identity Center (the service formerly called AWS SSO) is the single front door: connect the corporate identity provider (Entra ID/Okta) over SAML and SCIM, define permission sets (reusable role templates such as AdministratorAccess, PowerUserAccess, ReadOnlyAccess, Billing), and assign groups to accounts with a permission set. Engineers run aws sso login and get short-lived credentials scoped to exactly the account and role they need.

Principal Access model Used for
Root user MFA, locked away, used almost never Account recovery, a handful of root-only tasks
Workforce (humans) IAM Identity Center + permission sets, short-lived All day-to-day console/CLI access
Workloads (apps) IAM roles (task roles, instance profiles) EC2/ECS/Lambda assuming roles, no static keys
CI/CD pipelines IAM roles via OIDC federation GitHub Actions/CodePipeline, no stored keys

This kills the two biggest real-world risks at once: leaked long-lived keys and standing admin. Detail: AWS IAM Identity Center at Scale and the foundations in AWS IAM least privilege & permission boundaries.

3. Network: hub-and-spoke with Transit Gateway

Decision: give each workload account its own VPC (the spoke) and connect them through a central Transit Gateway in the Network account (the hub), sharing the TGW across the Organization with Resource Access Manager (RAM). Centralise internet egress in the Network account (one set of NAT gateways and, optionally, a Network Firewall) so all outbound traffic is inspected and logged in one place, and so spokes stay small and disposable. The application VPC uses a standard three-tier subnet layout across two Availability Zones: public subnets for the ALB, private subnets for the Fargate tasks, and isolated private subnets for RDS.

Subnet tier AZ-a / AZ-b Routes to Holds
Public 10.20.0.0/24 / 10.20.1.0/24 Internet Gateway ALB, NAT gateways
Private (app) 10.20.10.0/24 / 10.20.11.0/24 NAT gateway / TGW ECS Fargate tasks
Isolated (data) 10.20.20.0/24 / 10.20.21.0/24 No internet route RDS Multi-AZ, no NAT

The alternative — VPC peering everywhere, or one flat shared VPC — does not scale (peering is not transitive, CIDRs collide, the security boundary erodes). Detail: Multi-Account VPC Connectivity with Transit Gateway.

4. Workload: ALB + ECS Fargate + RDS Multi-AZ

Decision: run the order app as a classic, robust 3-tier architecture. An internet-facing Application Load Balancer in the public subnets terminates TLS (certificate from ACM) and routes HTTP/HTTPS to the application tier. The application tier is ECS on Fargate — serverless containers, no EC2 to patch — spread across both AZs, behind a target group, with service auto scaling on CPU/request count. The data tier is Amazon RDS (PostgreSQL) in Multi-AZ mode: a synchronous standby in the second AZ that takes over automatically on failure.

Tier Service Why this choice Spread
Presentation / routing Application Load Balancer L7 routing, TLS termination, health checks, sticky sessions Both public subnets
Application ECS Fargate (Graviton) No servers to patch, scales to demand, per-task IAM role Both private (app) subnets
Data RDS PostgreSQL Multi-AZ Managed, automatic failover, backups, point-in-time recovery Primary + standby across AZs

Fargate over EC2 removes an entire patching and capacity-planning burden (Operational Excellence + Security); Multi-AZ RDS is the single most important reliability decision for a stateful app. Detail: Production Amazon ECS on Fargate.

5. Security guardrails: SCPs, GuardDuty, KMS, Secrets Manager

Decision: defence in depth, preventive and detective. Service Control Policies at the OU level set the outer boundary of what any principal in those accounts can do — even an account admin cannot exceed them. GuardDuty is enabled organisation-wide from the Audit account for continuous threat detection. KMS customer-managed keys encrypt RDS, EBS, S3 and secrets, with key policies granting least-privilege use. Secrets Manager holds the database credentials with automatic rotation; nothing sensitive is ever baked into a task definition or image.

Control Type Scope Guards against
SCP: deny leaving Org, deny root actions, region lock Preventive Workloads OU Account takeover, drift, data residency breach
SCP: deny disabling CloudTrail/GuardDuty/Config Preventive All OUs Tampering with the audit/detection layer
GuardDuty Detective Org-wide (Audit account) Compromised credentials, crypto-mining, recon
KMS CMKs Protective Per account/service Plaintext data at rest
Secrets Manager + rotation Protective Workload accounts Hard-coded/long-lived DB credentials

Detail: SCP guardrails & delegated admin, KMS multi-Region keys & envelope encryption, and Secrets Manager automatic rotation.

6. Observability baseline

Decision: you cannot operate what you cannot see. CloudTrail is enabled as an organisation trail writing immutable logs to the Log Archive account, so every API call in every account is recorded centrally and out of reach of a compromised workload account. CloudWatch collects metrics, logs (container logs via the awslogs/Fire Lens driver) and traces; a dashboard shows the golden signals (latency, error rate, request count, saturation) and alarms page on ALB 5xx, Fargate CPU, and RDS connections/free storage. AWS Config records resource configuration for compliance and drift.

Signal Source Where it lands Alarm on
API audit trail CloudTrail org trail Log Archive S3 (immutable) Root usage, IAM changes
App/infra metrics CloudWatch Per-account + dashboard ALB 5xx, target health, CPU
Container logs ECS awslogs driver CloudWatch Logs Error-rate metric filters
Resource config/drift AWS Config Log Archive account Non-compliant resources

Detail builds on the troubleshooting lessons; for cross-account log/metric strategy see the landing-zone and Control Tower lessons.

7. Disaster recovery and resilience

Decision: the workload survives the loss of an Availability Zone with no human action (Multi-AZ RDS failover, Fargate tasks rescheduled in the surviving AZ, ALB removing the dead targets) and survives data loss or corruption through automated backups with point-in-time recovery and a periodic snapshot copied to a second Region for a regional disaster. We state an explicit RTO/RPO so the design is testable.

Failure Mechanism Target
Single instance/task dies ECS reschedules, ALB health check removes target RTO seconds, RPO 0
Availability Zone fails RDS Multi-AZ failover + tasks already in 2nd AZ RTO 1–2 min, RPO ~0
Data corruption / bad deploy RDS point-in-time recovery; ECS rollback RTO minutes, RPO ≤5 min
Region fails Restore cross-Region snapshot + redeploy IaC RTO hours, RPO last copied snapshot

This is the warm-standby-within-a-Region, pilot-light-across-Regions posture appropriate for a mid-size retailer. To go further (active-passive or active-active multi-Region), see Enterprise multi-Region architecture on AWS and AWS DR strategies. Backups org-wide: AWS Backup with Organizations & Vault Lock.

8. Cost controls

Decision: cost is engineered, not discovered on the bill. A mandatory tagging standard (CostCenter, Owner, Environment, Application) is enforced so Cost Explorer can slice spend by team and environment; AWS Budgets sets a monthly budget per account with alerts at 80% and 100% to the owner before the month closes; non-production runs Fargate Spot and is shut down out of hours; steady-state Fargate and RDS are covered by Compute Savings Plans and Reserved Instances once the baseline is known. This answers Meridian’s “no surprises on the bill” directly.

AWS capstone: Well-Architected landing zone + 3-tier app

The diagram above is the target state we are building toward: the Organizations OU hierarchy on the left (management, Security, Infrastructure, Workloads) with SCP guardrails inheriting downward; IAM Identity Center as the human front door; the Network account’s Transit Gateway hub peered to the application VPC; and inside that VPC the 3-tier app — ALB in public subnets, ECS Fargate in private subnets, RDS Multi-AZ in isolated subnets — wrapped by GuardDuty, KMS, Secrets Manager, and a CloudWatch/CloudTrail observability plane. Keep it open as a map while you build; each phase below fills in one part of it.

Staged build plan

You do not build a landing zone in one giant deployment — you build it in phases, validating each before the next, and you build it with infrastructure as code so it is reproducible and reviewable. The platform team uses CloudFormation/Terraform; Control Tower itself is largely click-or-blueprint-driven for the initial setup. Here is the plan; each phase names the deeper lesson to open if you need more than the snippet, and the hands-on lab that follows builds a free-tier slice of phases 3, 4 and 6 end to end.

Phase What you build Pillar focus Reuse lesson
0. Foundations Account, CLI, MFA on root, billing alerts Operational Excellence Earlier course lessons
1. Account structure Control Tower, Organization, OUs, accounts Security, Cost Control Tower landing zone
2. Identity Identity Center, permission sets, group assignments Security Identity Center at scale
3. Network VPC, 3-tier subnets, IGW/NAT, Transit Gateway hub Reliability, Security Transit Gateway architecture
4. Workload ALB + ECS Fargate + RDS Multi-AZ Reliability, Performance ECS on Fargate
5. Security SCPs, GuardDuty, KMS, Secrets Manager Security SCP guardrails
6. Observability CloudTrail org trail, CloudWatch dashboard + alarms Operational Excellence Troubleshooting lessons
7. DR & cost Multi-AZ, cross-Region backups, Budgets, tags Reliability, Cost AWS Backup
8. Review Well-Architected review across six pillars All Well-Architected reliability

Representative IaC for the core pieces

You will mix tools in real life: Control Tower (and Account Factory) for account vending, CloudFormation StackSets to push baselines across accounts, and Terraform for the workload. Here are representative snippets for the load-bearing pieces.

An OU and an SCP (CloudFormation, Organizations):

Resources:
  WorkloadsOU:
    Type: AWS::Organizations::OrganizationalUnit
    Properties:
      Name: Workloads
      ParentId: !Ref RootId

  DenyLeaveOrgSCP:
    Type: AWS::Organizations::Policy
    Properties:
      Name: deny-leave-org-and-root
      Type: SERVICE_CONTROL_POLICY
      TargetIds: [!Ref WorkloadsOU]
      Content:
        Version: "2012-10-17"
        Statement:
          - Sid: DenyLeaveOrganization
            Effect: Deny
            Action: organizations:LeaveOrganization
            Resource: "*"
          - Sid: DenyRootUser
            Effect: Deny
            Action: "*"
            Resource: "*"
            Condition:
              StringLike:
                aws:PrincipalArn: "arn:aws:iam::*:root"

The application VPC with 3-tier subnets (Terraform):

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  name    = "meridian-prod"
  cidr    = "10.20.0.0/16"

  azs              = ["eu-west-1a", "eu-west-1b"]
  public_subnets   = ["10.20.0.0/24", "10.20.1.0/24"]   # ALB
  private_subnets  = ["10.20.10.0/24", "10.20.11.0/24"]  # Fargate
  database_subnets = ["10.20.20.0/24", "10.20.21.0/24"]  # RDS (isolated)

  enable_nat_gateway   = true
  single_nat_gateway   = false   # one NAT per AZ for reliability
  enable_dns_hostnames = true
}

An ECS Fargate service behind a target group (Terraform, abridged):

resource "aws_ecs_service" "order_app" {
  name            = "order-app"
  cluster         = aws_ecs_cluster.this.id
  task_definition = aws_ecs_task_definition.order_app.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = module.vpc.private_subnets       # private only
    security_groups  = [aws_security_group.app.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "order-app"
    container_port   = 8080
  }

  deployment_circuit_breaker { enable = true, rollback = true }
}

An RDS Multi-AZ instance, encrypted, private (Terraform):

resource "aws_db_instance" "orders" {
  identifier              = "meridian-orders"
  engine                  = "postgres"
  instance_class          = "db.t4g.micro"   # Graviton
  allocated_storage       = 20
  multi_az                = true              # synchronous standby in 2nd AZ
  db_subnet_group_name    = aws_db_subnet_group.isolated.name
  vpc_security_group_ids  = [aws_security_group.db.id]
  storage_encrypted       = true
  kms_key_id              = aws_kms_key.data.arn
  backup_retention_period = 7
  manage_master_user_password = true         # credentials go to Secrets Manager
  publicly_accessible     = false
}

Hands-on lab — build a free-tier landing-zone slice

You will build a real, working slice of the landing zone and workload using the AWS CLI in CloudShell — no installs. To stay Free-Tier-friendly and avoid needing a multi-account Organization on a personal account, the lab builds the network + workload + cost guardrail (phases 3, 4 and 6) inside a single account; the commands are identical in shape to the per-account version, and the multi-account specifics are exactly as described in the design above. Everything goes into resources you delete at the end.

Note on scope: a real Control Tower landing zone provisions Organization accounts that a personal account may not be enrolled for, so the lab models the structure with one VPC and tags. The networking, Fargate and RDS commands are production-shaped.

1. Set context. Open CloudShell and confirm where you are:

aws sts get-caller-identity --output table
export AWS_REGION=eu-west-1
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
echo "Working in account: $ACCOUNT_ID, region: $AWS_REGION"

2. Create the 3-tier VPC (phase 3). Create the VPC and one public + one private subnet (the lab uses one AZ’s worth to stay small; production uses two):

VPC_ID=$(aws ec2 create-vpc --cidr-block 10.20.0.0/16 \
  --tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=meridian-lab},{Key=Environment,Value=lab},{Key=CostCenter,Value=retail}]' \
  --query Vpc.VpcId --output text)

PUB_SUBNET=$(aws ec2 create-subnet --vpc-id "$VPC_ID" \
  --cidr-block 10.20.0.0/24 --availability-zone ${AWS_REGION}a \
  --query Subnet.SubnetId --output text)

PRIV_SUBNET=$(aws ec2 create-subnet --vpc-id "$VPC_ID" \
  --cidr-block 10.20.10.0/24 --availability-zone ${AWS_REGION}a \
  --query Subnet.SubnetId --output text)

echo "VPC=$VPC_ID  public=$PUB_SUBNET  private=$PRIV_SUBNET"

3. Add an internet gateway and a public route so the ALB tier can reach the internet:

IGW_ID=$(aws ec2 create-internet-gateway --query InternetGateway.InternetGatewayId --output text)
aws ec2 attach-internet-gateway --internet-gateway-id "$IGW_ID" --vpc-id "$VPC_ID"

RTB_ID=$(aws ec2 create-route-table --vpc-id "$VPC_ID" --query RouteTable.RouteTableId --output text)
aws ec2 create-route --route-table-id "$RTB_ID" \
  --destination-cidr-block 0.0.0.0/0 --gateway-id "$IGW_ID"
aws ec2 associate-route-table --route-table-id "$RTB_ID" --subnet-id "$PUB_SUBNET"

4. Create an ECS Fargate cluster and a tiny task (phase 4). We register a minimal task definition (a public sample container) and run it on Fargate to prove the application tier works:

aws ecs create-cluster --cluster-name meridian-lab \
  --capacity-providers FARGATE FARGATE_SPOT \
  --tags key=Environment,value=lab key=CostCenter,value=retail

# Execution role lets Fargate pull images and write logs
aws iam create-role --role-name meridianTaskExecRole \
  --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ecs-tasks.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
aws iam attach-role-policy --role-name meridianTaskExecRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

cat > task.json <<EOF
{
  "family": "order-app",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256", "memory": "512",
  "runtimePlatform": { "cpuArchitecture": "ARM64", "operatingSystemFamily": "LINUX" },
  "executionRoleArn": "arn:aws:iam::${ACCOUNT_ID}:role/meridianTaskExecRole",
  "containerDefinitions": [{
    "name": "order-app",
    "image": "public.ecr.aws/nginx/nginx:stable",
    "portMappings": [{ "containerPort": 80 }],
    "essential": true
  }]
}
EOF
aws ecs register-task-definition --cli-input-json file://task.json

5. Create a cost guardrail (phase 6). Set a small monthly Budget with an alert so you are warned before spend climbs:

cat > budget.json <<EOF
{ "BudgetName": "meridian-lab-monthly", "BudgetLimit": { "Amount": "10", "Unit": "USD" },
  "TimeUnit": "MONTHLY", "BudgetType": "COST" }
EOF
cat > notify.json <<EOF
[ { "Notification": { "NotificationType": "ACTUAL", "ComparisonOperator": "GREATER_THAN",
      "Threshold": 80, "ThresholdType": "PERCENTAGE" },
    "Subscribers": [ { "SubscriptionType": "EMAIL", "Address": "you@example.com" } ] } ]
EOF
aws budgets create-budget --account-id "$ACCOUNT_ID" \
  --budget file://budget.json --notifications-with-subscribers file://notify.json

6. Validate. Prove the slice exists and is wired correctly:

# VPC and subnets present and tagged
aws ec2 describe-vpcs --vpc-ids "$VPC_ID" \
  --query "Vpcs[0].{cidr:CidrBlock,tags:Tags}" --output json

# Public subnet has a route to the internet gateway
aws ec2 describe-route-tables --route-table-ids "$RTB_ID" \
  --query "RouteTables[0].Routes[?GatewayId=='$IGW_ID']" --output table

# Cluster is ACTIVE and the task definition registered
aws ecs describe-clusters --clusters meridian-lab \
  --query "clusters[0].{name:clusterName,status:status}" --output table
aws ecs describe-task-definition --task-definition order-app \
  --query "taskDefinition.{family:family,cpu:cpu,arch:runtimePlatform.cpuArchitecture}" --output table

# Budget present
aws budgets describe-budget --account-id "$ACCOUNT_ID" \
  --budget-name meridian-lab-monthly --query "Budget.BudgetName" --output text

Expected: the VPC shows your CIDR and tags, the route table lists a 0.0.0.0/0 route to the IGW, the cluster status is ACTIVE, the task definition reports ARM64, and the budget name prints. You now have, in miniature, the load-bearing pillars: a tiered network, a Graviton Fargate application tier, and a cost guardrail with tagging.

7. Cleanup. Remove everything to stay in Free Tier:

aws budgets delete-budget --account-id "$ACCOUNT_ID" --budget-name meridian-lab-monthly
aws ecs deregister-task-definition --task-definition order-app >/dev/null 2>&1 || true
aws ecs delete-cluster --cluster meridian-lab
aws iam detach-role-policy --role-name meridianTaskExecRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
aws iam delete-role --role-name meridianTaskExecRole
aws ec2 associate-route-table --route-table-id "$RTB_ID" --subnet-id "$PUB_SUBNET" >/dev/null 2>&1 || true
aws ec2 delete-route-table --route-table-id "$RTB_ID" 2>/dev/null || true
aws ec2 detach-internet-gateway --internet-gateway-id "$IGW_ID" --vpc-id "$VPC_ID"
aws ec2 delete-internet-gateway --internet-gateway-id "$IGW_ID"
aws ec2 delete-subnet --subnet-id "$PUB_SUBNET"
aws ec2 delete-subnet --subnet-id "$PRIV_SUBNET"
aws ec2 delete-vpc --vpc-id "$VPC_ID"

Cost note: an empty VPC, subnets, an internet gateway, an ECS cluster with no running tasks, a task definition and a Budget are all free; the only thing that would cost money is leaving a NAT gateway, a running Fargate task, an ALB or an RDS instance up — so this lab, cleaned up the same day, stays comfortably in Free-Tier territory. If you extend it to run a task or an ALB/RDS, expect a few cents to a few dollars and delete promptly.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Fargate task stuck in PENDING then fails No route to pull the image (private subnet, no NAT/endpoint) or missing execution role Give the subnet a NAT route or add ECR/S3/CloudWatch VPC endpoints; attach AmazonECSTaskExecutionRolePolicy
ALB targets show unhealthy Health-check path/port wrong, or the task’s security group does not allow the ALB Point the health check at a real path/port; allow the ALB SG inbound on the container port in the task SG
RDS unreachable from the app DB in isolated subnet with no SG rule from the app tier Allow the app SG inbound on 5432 in the DB SG; never make RDS publicly accessible
“Access Denied” after switching to Identity Center Permission set too narrow, or an SCP at the OU denies the action Read the denied action; widen the permission set or check the SCP — an explicit deny in an SCP cannot be overridden by any IAM allow
New account ignores your guardrails SCP attached to the wrong OU, or account sits outside the governed OU Move the account under the correct OU; SCPs only flow to accounts in the targeted OU subtree (and never to the management account)
CloudTrail “stops” for one account Someone disabled the local trail Use an organisation trail and an SCP denying cloudtrail:StopLogging so it cannot be turned off
Budget alert never arrives Email subscription not confirmed, or alert threshold above spend Confirm the SNS/email subscription; lower the threshold to test; remember Budgets data lags a few hours
Multi-AZ failover took longer than expected App holds stale DNS or long-lived DB connections Use the RDS endpoint (not the IP), set sane connection-pool TTLs; consider RDS Proxy for faster failover

Best practices

Security notes

The landing zone is your security baseline, so treat it that way. Use multi-account isolation as the primary blast-radius control — a compromise in development must not reach production. Front all human access with IAM Identity Center issuing short-lived credentials, grant least-privilege permission sets to groups, and keep the root user behind MFA and effectively unused. Set the outer boundary with SCPs (deny leaving the Org, deny disabling CloudTrail/GuardDuty/Config, deny risky regions) — remember an SCP is a guardrail, not a grant: it can only take permissions away, and an explicit deny anywhere wins. Turn on GuardDuty organisation-wide from the Audit account and CloudTrail as an organisation trail to the immutable Log Archive account so detection and audit cannot be tampered with from a workload account. Encrypt everything at rest with KMS customer-managed keys and in transit with TLS (ACM on the ALB). Keep database credentials in Secrets Manager with rotation — never in a task definition, environment variable or image. Keep the data tier in isolated subnets with security-group rules referencing the application tier’s group, not CIDR ranges. And give CI/CD pipelines OIDC-federated roles, not stored access keys.

Interview & exam questions

Q1. Why multiple AWS accounts instead of one account with multiple VPCs? Accounts are the strongest isolation and billing boundary AWS offers. Separate accounts give you a hard blast-radius boundary (a breach or runaway cost in one cannot touch another), clean per-team/per-environment billing, independent service quotas, and the ability to apply different guardrails (SCPs) per environment. VPCs and IAM within one account share a single failure and trust domain.

Q2. What is the difference between an SCP and an IAM policy? An SCP is an Organizations guardrail attached to an OU/account that sets the maximum permissions available to every principal in those accounts — it can only deny or limit, never grant. An IAM policy grants permissions to a specific principal within an account. The effective permission is the intersection: an action must be allowed by IAM and not denied by any SCP. SCPs do not apply to the management account.

Q3. Walk me through the policy-evaluation order when an action is requested. Explicit deny anywhere (SCP, resource policy, identity policy, permission boundary, session policy) wins outright. Otherwise the action must be explicitly allowed by an identity or resource policy and permitted by every applicable boundary (SCP, permission boundary, session policy). With no explicit allow, the default is an implicit deny. So: explicit deny > explicit allow > implicit deny, with all guardrails intersected.

Q4. Why ECS Fargate over EC2 for the application tier, and what is the trade-off? Fargate removes the EC2 layer — no instances to patch, scale or right-size — which improves Operational Excellence and Security (smaller attack surface, no SSH) and lets you scale to demand per task. The trade-offs are slightly higher per-vCPU cost at steady high utilisation, less control over the host, and no daemonset-style access. For spiky or modest workloads and teams that do not want to run servers, Fargate usually wins; for very large steady fleets, EC2 (or EKS on EC2) can be cheaper.

Q5. How does RDS Multi-AZ work, and how is it different from a read replica? Multi-AZ maintains a synchronous standby in a second AZ; on primary failure RDS fails over automatically by repointing the DNS endpoint, typically in 60–120 seconds, with no data loss (RPO ≈ 0). It is a reliability feature and the standby serves no traffic until failover. A read replica is asynchronous and exists to scale reads (or for cross-Region DR); it can lag and is promoted manually. They solve different problems and are often used together.

Q6. The database is in an isolated subnet. How does the application reach it, and why this way? The DB security group allows inbound on the database port (e.g. 5432) from the application tier’s security group (a security-group reference, not a CIDR). The DB subnet has no route to a NAT or internet gateway, so it cannot reach or be reached from the internet. This is least-privilege networking: only the app tier can talk to the database, and the database is unreachable from outside the VPC even if a rule is misconfigured.

Q7. How do you give humans access without IAM users or access keys? Connect the corporate IdP to IAM Identity Center over SAML/SCIM, define reusable permission sets, and assign groups to accounts. Engineers authenticate with aws sso login and receive short-lived credentials scoped to the chosen account and role. There are no long-lived keys to leak and no standing admin — access is centrally managed and fully audited in CloudTrail.

Q8. What does a CloudTrail “organisation trail” to a separate Log Archive account buy you? A single trail capturing API activity across every account, written to an S3 bucket in a dedicated, locked-down Log Archive account that workload-account admins cannot reach. That gives you a complete, centralised, tamper-resistant audit trail; pairing it with an SCP that denies cloudtrail:StopLogging means even a compromised account cannot blind your auditing.

Q9. Map this architecture to the six Well-Architected pillars in one line each. Operational Excellence: IaC + dashboards/alarms + org CloudTrail. Security: multi-account + Identity Center + SCPs + GuardDuty + KMS + private subnets. Reliability: Multi-AZ RDS + Fargate across AZs + ALB health checks + backups. Performance Efficiency: Fargate auto scaling + Graviton + right-sized RDS. Cost Optimization: Budgets + tags + Spot/Savings Plans. Sustainability: Graviton + scale-to-demand + Spot + efficient storage.

Q10. State an RTO/RPO for an AZ failure and a Region failure, and how each is met. AZ failure: RTO ~1–2 minutes, RPO ≈ 0 — met by Multi-AZ RDS automatic failover and Fargate tasks already running in the surviving AZ behind the ALB. Region failure: RTO hours, RPO = last copied snapshot — met by restoring a cross-Region RDS snapshot and redeploying the IaC into the second Region (pilot-light). Going to active-passive or active-active reduces both at higher cost.

Q11. How do you stop costs surprising Finance? Enforce a tagging standard (CostCenter/Owner/Environment/Application) so Cost Explorer slices spend by team; set per-account AWS Budgets with alerts at 80%/100% before month-end; run non-prod on Fargate Spot and shut it down off-hours; and cover steady-state Fargate/RDS with Savings Plans/Reserved Instances once the baseline is known.

Q12. A new team account is not picking up your guardrails. What is wrong? Almost always the account is not under the governed OU, or the SCP is attached to the wrong OU. SCPs flow only to accounts inside the targeted OU subtree (and never to the management account). Move the account under the correct OU (e.g. Workloads/Non-Prod), confirm the SCP is attached at or above that OU, and verify Control Tower enrolled the account so its baseline stacks are deployed.

Quick check

  1. Why is an AWS account a stronger isolation boundary than a VPC or an IAM role within one account?
  2. Can an IAM policy grant a permission that an SCP denies? Why or why not?
  3. In the 3-tier design, which tier is public, which is private, and where does the database live?
  4. What single RDS setting gives you automatic failover to another Availability Zone?
  5. Why run an organisation CloudTrail into a separate Log Archive account rather than a per-account trail?

Answers

  1. Because the account is AWS’s hardest boundary — separate accounts have independent permissions (SCPs), quotas, billing and trust, so a compromise or runaway cost in one cannot reach another. A VPC or role within one account still shares that account’s single failure and trust domain.
  2. No. An SCP sets the maximum permissions for an account; the effective permission is the intersection of IAM allows and SCP limits, and an explicit deny always wins. If an SCP denies an action, no IAM allow can restore it (except in the management account, where SCPs do not apply).
  3. The ALB sits in the public subnets; the ECS Fargate application tier sits in the private subnets (no public IP); the RDS database lives in isolated subnets with no internet route, reachable only from the app tier’s security group.
  4. Multi-AZ — it maintains a synchronous standby in a second AZ and fails over automatically by repointing the DB endpoint, with RPO ≈ 0.
  5. An organisation trail captures every account’s API activity in one immutable, locked-down Log Archive account that workload admins cannot reach or disable — a complete, tamper-resistant audit trail — whereas per-account trails can be disabled locally and scatter the evidence.

Exercise

Extend the capstone with a deliberate resilience test and a Well-Architected gap analysis. First, in a non-production environment, take the 3-tier app you designed and force an AZ failure: reboot the RDS instance with failover (aws rds reboot-db-instance --db-instance-identifier meridian-orders --force-failover) and confirm the application keeps serving while the standby is promoted, noting the actual recovery time. Then run a one-page Well-Architected review: for each of the six pillars, write the single biggest remaining gap in your build and the next remediation (for example — Reliability: “no cross-Region restore tested → schedule a quarterly DR game-day”; Cost: “no Savings Plan yet → buy a 1-year Compute Savings Plan once baseline is stable”). Conclude with the one remediation you would do first and why. Clean up afterward.

Certification mapping

This capstone maps most directly to the AWS Certified Solutions Architect – Associate (SAA-C03) and Professional (SAP-C02) exams, and exercises domains from several others:

Glossary

Next steps

Congratulations — that is the AWS Zero-to-Hero capstone, and the end of the course. You have designed and built a governed multi-account landing zone with a production 3-tier workload on it and reviewed the whole thing against the six Well-Architected pillars: you can now talk through an end-to-end AWS environment in an interview with real authority.

To take any single pillar of this capstone to full production depth, build on the deeper KloudVin lessons:

AWSLanding ZoneWell-ArchitectedCapstoneECS FargateOrganizations
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading