Secure SFTP Ingestion Gateway for B2B Partner File Exchange on AWS

A national health-insurance payer runs its claims operation on a daily tide of files. Every night, 200-odd provider groups, clearinghouses, and pharmacy-benefit managers drop in EDI 837 claim batches, 835 remittance files, eligibility rosters, and the occasional multi-gigabyte member-enrollment dump. For fifteen years this ran on a single hardened SFTP box in a co-lo: one Linux server, a /home/partner_* directory per trading partner, a cron job that swept files into the claims mainframe, and a runbook that lived in one engineer’s head. Then three things happened in the same quarter. The box’s disk filled at 2 a.m. during open-enrollment season and silently dropped a clearinghouse’s submissions; an audit flagged that any partner who guessed another partner’s username could cd .. into a sibling’s claims data; and a partner uploaded a macro-laden spreadsheet that an analyst opened on the corporate LAN. The CISO’s verdict was blunt: the front door to the most regulated data the company holds — PHI under HIPAA, plus state insurance regulators watching — cannot be a pet server. The ask is a managed, per-partner-isolated, encrypted, virus-scanned ingestion gateway that an auditor will sign and that does not page someone at 2 a.m. This article is that reference architecture on AWS.

The pressures are the familiar B2B-integration ones, sharpened by healthcare. Isolation is non-negotiable: partner A must be cryptographically and authoritatively unable to see partner B’s files, and “we set the directory permissions carefully” is not a control an auditor accepts. Compliance means every byte encrypted at rest and in transit, every login and every file event logged immutably, and a defensible answer to “who touched this claim file and when.” Safety means no partner-supplied file reaches a downstream system or a human’s laptop before it has been scanned and validated. And operability means the thing scales for open-enrollment spikes, costs money only when files actually move, and survives an Availability Zone failure without a heroics-driven recovery.

Why not the obvious shortcuts

Three tempting fixes each fail predictably, and naming why matters because someone on the project will propose all three.

Keep the EC2 SFTP server, just make it bigger and put it behind a load balancer. This inherits every problem that started the project — you still patch the OS, you still rely on POSIX permissions for tenant isolation, you still own the disk that fills, and a load balancer in front of stateful per-partner home directories is its own distributed headache. You have made the pet server a herd of pets.

Have partners push to a public S3 bucket with presigned URLs or per-partner IAM keys. Most trading partners’ integration stacks speak SFTP and nothing else — their side is a scheduled sftp put from middleware a vendor wrote a decade ago, and “please rewrite your file delivery to use the AWS SDK” is a multi-quarter negotiation with 200 counterparties that will simply not happen. SFTP is the lingua franca of B2B file exchange precisely because it is everywhere.

Buy a heavyweight managed file-transfer (MFT) appliance and run it in the cloud. A virtual appliance from a traditional MFT vendor gives you a rich workflow engine, but you are back to capacity-planning instances, licensing per-connection, and patching a vendor’s image — and it does not natively land files in S3 where the rest of your event-driven AWS pipeline wants them.

The threading-the-needle answer is AWS Transfer Family: a fully managed SFTP (and FTPS/FTP) endpoint that speaks the protocol partners already use, authenticates each partner independently, and writes directly into S3 — turning every uploaded file into an S3 object event that the rest of the architecture can react to. The protocol stays boringly compatible for the partner; the back end becomes serverless, isolated, and event-driven for you.

Architecture overview

Secure SFTP Ingestion Gateway for B2B Partner File Exchange on AWS — architecture

The gateway runs three conceptually distinct flows that share infrastructure: the transfer flow (a partner authenticates and uploads), the validation-and-quarantine flow (event-driven scanning and schema checks before anything is trusted), and the promotion flow (clean files handed to downstream claims processing). Keeping them separate in your head is the first step to operating this well.

The defining property of the topology is the one the auditor cares about most: isolation is enforced by identity and IAM, not by filesystem permissions. Each trading partner is a distinct Transfer Family user whose session is locked to a per-partner S3 prefix by a scoped-down IAM session policy and a logical home directory that makes that prefix look like / to the partner. Partner A’s SFTP session is issued credentials that grammatically cannot reference partner B’s prefix — the deny is in the policy the API evaluates, not in a chmod someone hopes is right.

Transfer flow, following the control flow:

A trading partner’s middleware opens an SFTP connection to the gateway’s stable endpoint. Partners reach a fixed set of Elastic IPs fronting the Transfer Family server (so partner firewalls can allowlist us), and Akamai sits at the edge providing DDoS protection and an allowlist/geo-fencing layer in front of the published IPs, so connection floods and traffic from regions no partner operates in are shed before they reach AWS.
Transfer Family invokes a custom-identity-provider Lambda to authenticate the session. The payer runs its internal operators and the partner-onboarding console through Okta (federated to Microsoft Entra ID where Azure-side tooling needs a token), but partners themselves authenticate with SSH public keys they registered during onboarding — the identity Lambda validates the partner’s key, and the keys, plus any partners still on password auth, are held in HashiCorp Vault rather than in the Lambda or an environment variable, so the secret material is leased, rotated, and audited centrally.
On a successful auth, the identity Lambda returns the partner’s scoped session policy, IAM role ARN, and HomeDirectoryMappings that pin the session to s3://payer-claims-landing/partners/<partner-id>/. The partner sees a clean home directory; the policy denies everything outside that prefix.
The partner uploads. Transfer Family streams the bytes straight into the S3 landing bucket, encrypted with a per-partner AWS KMS key (SSE-KMS). No file lands on any instance disk we own; there is no disk to fill.
The completed upload emits an S3 ObjectCreated event (and Transfer Family logs the transfer to CloudWatch / CloudTrail). The transfer flow is done; the file exists but is explicitly untrusted.

Validation-and-quarantine flow, event-driven and the security heart of the design:

The S3 event lands on an SQS queue that buffers bursts and gives retries, which triggers a validation Lambda. The file is first scanned for malware — the Lambda invokes a ClamAV-based scanning function (the AV engine and signature database packaged in a container image on Lambda, or fronted by a small Fargate task for very large files). Until the scan returns clean, the object is treated as quarantined.
Infected or unscannable files are moved to a separate quarantine bucket (a different account-boundary-friendly bucket with its own restrictive policy), the partner’s upload is rejected, and a CrowdStrike Falcon-monitored alert plus a ServiceNow incident are raised so the SOC and the partner-operations team both have a ticket, not just a log line.
Clean files pass schema validation — an EDI 837/835 well-formedness and envelope check, file-naming and size-range checks, and a duplicate-control-number check against recent submissions. Malformed-but-clean files go to a rejected/ prefix with a machine-readable reason the partner can self-serve.

Promotion flow: a validated file is copied to a trusted/curated prefix (or a separate processing bucket), which fires the downstream event that hands it to claims adjudication — historically the mainframe via an AWS-side adapter, increasingly a Step Functions workflow. Only files that are encrypted, scanned, and schema-valid ever reach this stage.

Component breakdown

Component	Service / tool	Role in the gateway	Key configuration choices
Edge protection	Akamai	DDoS mitigation, IP allowlist / geo-fence ahead of the SFTP IPs	Allow only known partner egress ranges; drop floods before AWS
SFTP endpoint	AWS Transfer Family	Managed multi-protocol server partners connect to	VPC-hosted endpoint; internet-facing via fixed EIPs; SFTP only
Partner auth	Custom IdP Lambda + Vault	Per-partner key/password validation and policy issuance	`HomeDirectoryMappings`; scoped session policy per partner
Operator SSO	Okta + Microsoft Entra ID	Internal admin & onboarding-console SSO	OIDC; Entra federation for Azure-side tooling; MFA + conditional access
Secrets	HashiCorp Vault	Partner SSH keys, password material, signing keys	KV/SSH engine; short leases; audited retrieval from the IdP Lambda
Landing store	Amazon S3 (landing)	Per-partner-prefixed encrypted inbox	SSE-KMS per-partner CMK; Block Public Access; Object Lock optional
Encryption	AWS KMS	Per-partner customer-managed keys	One CMK per partner; key policy scoped to that partner’s role
Event buffer	Amazon SQS	Decouples S3 events from validation, gives retries/DLQ	Visibility timeout > scan time; dead-letter queue for poison files
Malware scan	Lambda + ClamAV (container)	Virus/malware scan before trust	Signature DB refreshed on schedule; large files to Fargate
Validation	AWS Lambda	EDI well-formedness, naming, dedupe, size checks	Idempotent on object key; structured reject reasons
Quarantine	Amazon S3 (quarantine)	Isolation of infected/unscannable files	Separate bucket, restrictive policy, no downstream read
Posture / data security	Wiz	CSPM, S3 exposure & sensitive-data scanning, attack paths	Agentless scan; alert on any public-exposure or ACL drift
Runtime security	CrowdStrike Falcon	Runtime threat detection on scan/processing compute	Sensor on Fargate/EC2; detections to the SOC
Observability	Datadog	Metrics, logs, traces, SLOs for the pipeline	Log pipeline from CloudWatch; monitors on lag and reject rate
ITSM / approvals	ServiceNow	Partner onboarding approvals, incidents on quarantine	Change gate to enable a partner; auto-ticket on infected file
CI / IaC	Jenkins / GitHub Actions + Terraform	Build, test, deploy infra and Lambdas	OIDC to AWS (no static keys); per-partner config as code

A few of these choices deserve the why, because they are the ones teams get wrong.

Why a custom identity provider, not Transfer Family’s service-managed users. Service-managed users work for a handful of partners, but at 200 partners onboarded and offboarded continuously, you want each partner’s keys, policy, and home mapping to be data your onboarding pipeline writes — not console clicks. The custom-IdP Lambda lets onboarding be a Terraform/ServiceNow-driven workflow: a partner record (their SSH public key, their prefix, their scoped policy) is created through an approved change, and the Lambda reads it at connection time. It is also where you enforce source-IP allowlisting per partner and pull key material from Vault instead of baking it in.

Why per-partner KMS keys, not one bucket key. A single bucket CMK encrypts everyone’s data under one key — fine until you must prove that revoking one partner’s access cryptographically severs their data, or until one partner’s contract requires their own key. A per-partner CMK whose key policy grants decrypt only to that partner’s IAM role makes isolation provable at the cryptographic layer and makes offboarding a key-policy change. The cost is more keys to manage; Terraform makes that a loop, and KMS key cost is trivial against the audit benefit.

Why scan in quarantine before the file is ever “real.” The original incident — an analyst opening a malicious upload — happened because partner files were trusted on arrival. Here, an uploaded object is inert: nothing downstream can read the landing prefix, and only the validation Lambda (and the AV scanner) touch it until it is proven clean and promoted. This is the single most important inversion in the design: files are guilty until proven innocent, and the proof is automated.

Implementation guidance

Provision with Terraform, and treat IAM scoping as the first deliverable. The control that the whole architecture rests on is the per-partner session policy, so get it right before anything else. A scoped session policy locks a partner’s SFTP session to its own prefix:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ListOwnPrefixOnly",
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::payer-claims-landing",
      "Condition": { "StringLike": {
        "s3:prefix": ["partners/${transfer:UserName}/*"] } }
    },
    {
      "Sid": "ReadWriteOwnObjectsOnly",
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject", "s3:GetObjectVersion"],
      "Resource": "arn:aws:s3:::payer-claims-landing/partners/${transfer:UserName}/*"
    }
  ]
}

The ${transfer:UserName} substitution is what makes one policy template safely serve every partner — the session can only ever name its own prefix. Pair it with a logical home directory so the partner never sees the bucket structure at all:

resource "aws_transfer_user" "partner" {
  server_id           = aws_transfer_server.sftp.id
  user_name           = var.partner_id
  role                = aws_iam_role.partner[var.partner_id].arn
  home_directory_type = "LOGICAL"
  home_directory_mappings {
    entry  = "/"
    target = "/payer-claims-landing/partners/${var.partner_id}"
  }
}

Stand up the endpoint as VPC-hosted with fixed EIPs. Run the Transfer Family server with a VPC endpoint type, internet-facing, with allocated Elastic IPs so partner firewalls can allowlist a stable set of addresses for years. Put Akamai in front for DDoS and geo/IP allowlisting. Enable SFTP only (drop plain FTP entirely), pin the server to a modern security policy (strong KEX/cipher suites, no legacy algorithms), and publish your host key fingerprint to partners so they can verify they are connecting to you.

Kill static credentials in the pipeline. The CI/CD pipeline that applies this — Jenkins for the regulated, on-prem-adjacent build stages and GitHub Actions for the cloud-native Lambda/Terraform stages — authenticates to AWS via OIDC federation, so there is no long-lived access key sitting in a credentials store to leak. The same pipeline runs unit tests on the validation Lambda and a smoke test that uploads a known-good and a known-bad (EICAR test) file end to end before promoting a change.

Wire the validation chain idempotently. S3 events can deliver more than once, so the validation Lambda must be idempotent on the object key — re-processing the same upload must not double-submit a claim batch. Use the S3 object version (or an item in a DynamoDB processing-ledger table) as the idempotency token, and let SQS plus a dead-letter queue absorb retries so a single poison file never blocks the queue. A trimmed validation handler shows the quarantine inversion:

def handler(event, _ctx):
    rec = parse_s3_event(event)            # bucket, key, version, partner_id
    if ledger_seen(rec.version):           # idempotency
        return ok("already-processed")
    verdict = av_scan(rec)                 # ClamAV container / Fargate
    if verdict != "clean":
        move_to(rec, QUARANTINE_BUCKET)    # isolate, do not delete (evidence)
        servicenow_incident(rec, verdict)  # ticket for SOC + partner ops
        return blocked(verdict)
    if not edi_valid(rec):                 # 837/835 envelope, dedupe, naming
        move_to(rec, f"rejected/{rec.key}", reason="schema")
        return rejected("schema")
    promote(rec, TRUSTED_PREFIX)           # only now is the file trusted
    ledger_commit(rec.version)
    return ok("promoted")

Enterprise considerations

Security & Zero Trust. The gateway is Zero Trust by construction: every partner session is independently authenticated and authorized to exactly one prefix, every object is encrypted with a partner-scoped key, and no public data-plane surface exists (S3 Block Public Access on, no anonymous access, endpoint reachable only over SFTP through allowlisted IPs). Layer on top: (a) Wiz running continuous CSPM and sensitive-data scanning across the landing, quarantine, and trusted buckets, alerting the instant any bucket drifts toward public exposure, a KMS key policy widens, or PHI is detected in an unexpected prefix — the posture backstop behind the IAM controls; (b) CrowdStrike Falcon sensors on the AV-scanning and processing compute for runtime threat detection, feeding the payer’s SOC; © an AWS Config rule set that flags any S3 bucket created without encryption or with public access, with Wiz as the independent verifier that the rules are actually holding; (d) GuardDuty for anomalous-access detection on the buckets. A quarantine event auto-raises a ServiceNow incident so security has a tracked record, and the infected object is moved, not deleted so it remains as forensic evidence.

Cost optimization. The serverless shape is the cost story — you pay for files that actually move, not for idle capacity. Engineer the few levers that matter:

Lever	Mechanism	Typical effect
No idle compute	Transfer Family bills per-hour-up + per-GB; Lambda per-invoke	No 24/7 EC2 SFTP fleet to pay for
S3 lifecycle	Transition landed/processed files to IA, then Glacier; expire rejects	Cuts storage cost on retained claim files
Right-size scanning	Small files on Lambda, only large dumps to Fargate	Avoids paying Fargate for the common case
SQS batching	Batch S3 events into fewer Lambda invokes	Fewer invocations during enrollment spikes
Endpoint consolidation	One Transfer server, many users, not one server per partner	Avoids per-server hourly charges multiplying

The one cost to watch is the Transfer Family per-protocol-hour charge, which accrues whether or not files flow; consolidating all partners onto one server (isolated by IAM, not by server) is what keeps that flat. Pipe cost and throughput metrics to Datadog for the chargeback view that partner-operations and finance share.

Scalability. Each tier scales independently and mostly on its own. Transfer Family absorbs concurrent partner sessions as a managed service; S3 has effectively unlimited landing capacity (the disk-full failure is designed out); Lambda scales validation concurrency automatically, with SQS as the shock absorber for an open-enrollment burst when a roster file lands from every partner the same night; Fargate scales out for the occasional multi-gigabyte enrollment dump that is too large for a Lambda’s runtime. The natural ceilings to plan for are Lambda concurrency limits in the account (request an increase before enrollment season) and any per-partner SFTP session concurrency a partner negotiates.

Failure modes, and what each one looks like. Name them before they page you.

A partner uploads a 4 GB file mid-stream and disconnects — a partial object that must not be processed as if complete. Mitigation: process only on the ObjectCreated:Put/CompleteMultipartUpload event for the finished object, and size-range/checksum validation rejects truncated files.
The AV signature database is stale — a known-bad file scans clean. Mitigation: a scheduled signature-refresh job, a freshness check that fails the scanner closed (treat as unscannable → quarantine) if signatures are older than a threshold.
A poison file loops in the queue — a malformed event repeatedly crashes the validation Lambda and blocks the partition. Mitigation: SQS dead-letter queue after N attempts, with a Datadog monitor on DLQ depth.
Misconfigured session policy — a template change accidentally widens a partner’s prefix scope. Mitigation: policy generated from one reviewed Terraform template, Wiz/Config asserting no cross-prefix access, and the EICAR-plus-cross-prefix smoke test in CI.
Duplicate submission — a partner re-sends the same claim batch after a timeout and it is adjudicated twice. Mitigation: the idempotency ledger and duplicate-control-number check in validation.

Reliability & DR (RTO/RPO). S3 is multi-AZ and eleven-nines durable by default, so a landed file survives an AZ loss with zero data loss — the original single-server, single-disk failure is engineered away. The Transfer Family endpoint is a managed, multi-AZ service within a Region. For Regional DR, replicate the landing and trusted buckets to a paired Region with S3 Replication (with replica KMS keys), keep the partner-config and Terraform state portable, and stand up a warm Transfer Family endpoint in the second Region behind the same Akamai layer so failover is a DNS/IP cutover the partners never see. A pragmatic target for this gateway: RTO 30 minutes, RPO near-zero for landed data (replication is continuous), with the understanding that in-flight transfers at the moment of a Regional event are simply re-sent by the partner — SFTP’s at-least-once retry is a feature here.

Observability. Instrument the pipeline end to end in Datadog: a log pipeline from CloudWatch carrying every Transfer Family auth and transfer event, plus the validation Lambda’s structured verdicts. Emit the metrics the business actually cares about — files received per partner, ingestion lag (upload-to-promotion latency), quarantine and reject rates per partner (a partner whose reject rate spikes has broken their export), scan throughput, and DLQ depth. Set SLO monitors so a stuck pipeline or a partner suddenly sending garbage surfaces on its own, and alert into the same ServiceNow queue. CloudTrail and the immutable transfer logs give the audit answer to “who uploaded which claim file and when,” which is the question the HIPAA auditor will ask first.

Governance. Partner onboarding and offboarding run through ServiceNow change approval — enabling a partner is a documented, approved action that triggers the Terraform that writes their record, never an ad-hoc console click. Keep every partner’s prefix, scoped policy, KMS key, and allowlisted source IPs as version-controlled configuration so the entire partner population is reviewable and reproducible. Retain landed claim files under an S3 lifecycle and, where regulators require tamper-proofing, S3 Object Lock in compliance mode so a claim file cannot be altered or deleted within its retention window. Log every transfer for audit and incident review, with the immutability that PHI handling demands.

Explicit tradeoffs

Accept these or do not build it. The managed, event-driven shape trades a familiar single server for a set of moving parts — an identity Lambda, a validation Lambda, a scanner, queues, and per-partner IAM and KMS objects — that you must wire, test, and monitor as a system. The custom identity provider is real code on the critical path: if the IdP Lambda is down, no partner can authenticate, so it needs the same rigor (tests, alarms, concurrency headroom) as the data path. Per-partner KMS keys and per-partner policies multiply object count; Terraform tames it, but it is more state to manage than one bucket and one key. And the quarantine-first inversion adds latency between upload and downstream availability — a clean file is not instantly “in the system,” it is scanned and validated first — which is exactly the safety you wanted but is a behavior change partners and downstream teams must understand.

The alternatives, and when they win. If you have a handful of long-lived partners and rarely onboard, Transfer Family’s service-managed users skip the custom IdP entirely and are simpler — graduate to the custom IdP when partner churn makes console management painful. If your partners can genuinely speak the AWS API, direct authenticated S3 uploads (per-partner roles, presigned URLs) remove the SFTP layer and its per-hour cost — but in B2B healthcare that is a fantasy for most counterparties. If you need rich, human-in-the-loop transformation workflows (complex routing, format translation, manual approvals per file), a dedicated MFT virtual appliance earns its license — at the cost of capacity-planning and patching instances. And if you need full AS2 (signed, receipted EDI over HTTP) because a partner contract mandates it, Transfer Family supports AS2 too and slots into this same landing-and-validation back end.

The shape of the win

For the payer, the payoff is not “SFTP in the cloud.” It is that on the worst night of open-enrollment season, 200 partners drop their files, the disk does not fill because there is no disk, partner A provably cannot see partner B’s claims because the IAM session policy makes it impossible, every file is encrypted under that partner’s own key the moment it lands, a malicious spreadsheet is moved to quarantine and ticketed before any analyst could open it, and the on-call engineer sleeps — because a stuck partner shows up as a Datadog SLO breach and a ServiceNow ticket, not a 2 a.m. phone call. And when the HIPAA auditor asks “show me who touched this claim file and prove partner isolation,” the answer is a CloudTrail query and an IAM policy, not a promise about directory permissions. Everything upstream — the per-partner KMS keys, the scoped session policies, the Vault-held SSH keys, the Wiz posture scanning, the CrowdStrike-watched scanner, the quarantine-first validation — exists to make that auditor, that CISO, and that partner-operations lead each say yes. Start with a few partners on a single server and one validation Lambda if you must; this is where a regulated, at-scale B2B file front door has to land.

Secure SFTP Ingestion Gateway for B2B Partner File Exchange on AWS

Why not the obvious shortcuts

Architecture overview

Component breakdown

Implementation guidance

Enterprise considerations

Explicit tradeoffs

The shape of the win

Written by Vinod

Comments

Keep Reading

The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active

The Azure Architecting Ladder: From a Simple Web App to Mission-Critical

Azure Architecture Case Studies: Real Proposal Walkthroughs (Easy → Complex)