AI/ML Multi-cloud

Computer Vision: Edge + Cloud Inference with Triton

A logistics operator runs 140 cross-dock facilities, and every one of them has the same problem at the loading bay: a truck pulls in, a forklift moves twelve pallets across a camera’s field of view in ninety seconds, and the system has to read each shipping label, verify the pallet count against the manifest, and flag damage — before the forklift drives away. There is no time to round-trip a 4K frame to a cloud region 60 ms away and wait for a response; at 30 fps that backlog is unrecoverable, and a single dropped frame is a mis-scanned pallet that becomes a chargeback three weeks later. But there is also no way to train and govern 140 independently-drifting models by hand. The answer is not “edge” or “cloud” — it is a deliberate split: inference at the edge for latency and resilience, training and governance in the cloud, and a hardened pipeline to push new model versions to the fleet without a truck roll. This article is a reference architecture for building that split properly, with NVIDIA Triton Inference Server as the serving substrate on both ends.

The business scenario

The driver is almost always physical-world latency plus fleet scale plus uptime that cannot depend on a WAN link. A cross-dock label-reader has a ~100 ms budget from frame to decision and 30 fps to keep up with; a retailer’s shrink-detection cameras watch self-checkout lanes where a 300 ms stall means the customer has already bagged the item; a manufacturer’s defect-inspection rig on a line moving 600 parts a minute simply cannot wait on anything off-site. In all three, the camera is in a building whose internet uplink is a single business-grade circuit that goes down for an afternoon a few times a year. If inference lives in the cloud, an uplink outage stops the line. That is unacceptable, so inference moves to a box on-site.

The naive fixes fail predictably. Pure-cloud inference — stream every frame to a GPU endpoint — burns egress bandwidth (a single 4K/30 camera is ~15 Mbit/s of H.264, and a facility has dozens), adds unrecoverable WAN latency, and dies entirely when the circuit drops. Pure-edge, hand-managed — flash a model onto each device and forget it — works for a quarter, then the models drift, you have 140 slightly different versions in the field, no one can tell you which device runs which weights, and re-training means someone physically visits the site. One giant model per device ignores that a cross-dock needs OCR + pallet-counting + damage-detection — three models — and that an edge GPU has finite memory you must share across them.

The split architecture threads this. Training is centralized: frames and hard-negative samples flow back from the fleet to a cloud data lake, models are trained on cloud GPUs where you have elastic capacity and a governance plane, and every model version is an immutable, signed artifact in a registry. Inference is local: each site runs Triton on an edge GPU, serving the current approved model versions with single-digit-millisecond latency, fully functional even when the uplink is down. Updates are over-the-air: a new model version, once it passes evaluation, is rolled out to the fleet as a signed bundle through a staged, cancellable pipeline — never a manual flash. The same architecture serves a 5-site pilot and a 140-site rollout; the difference is fleet-management tooling and rollout batch sizes, not the shape of the diagram.

Architecture overview

The design has three planes that share a model registry but run on completely different clocks: an inference plane at the edge (synchronous, hard-real-time, must survive WAN loss), a training plane in the cloud (batch, GPU-heavy, where governance lives), and a fleet-management / OTA plane that connects them (event-driven, the control channel). Keeping these three mentally separate is the first step to operating this well — they fail independently and scale independently.

Computer Vision: Edge + Cloud Inference with Triton — architecture

Inference plane (edge), numbered as in the diagram: (1) a camera streams RTSP into a frame-ingest service on the edge node, which decodes on the GPU (NVDEC) and batches frames. (2) The service calls NVIDIA Triton Inference Server running locally — over shared-memory or gRPC on localhost, never the network — which hosts the active model ensemble: a detector locates labels/pallets, the crops feed an OCR model, and a classifier scores damage. Triton runs these as a single ensemble (model pipeline) so the orchestration stays inside the server and the GPU stays saturated via dynamic batching. (3) Decisions (label text, pallet count, damage flag) go to the local application, which reconciles against the manifest and acts in real time. (4) Every Nth frame and every low-confidence or disagreement case is captured as a hard-negative sample and queued locally for upload — this is the data flywheel that improves the model. The entire inference path runs with zero cloud dependency; if the uplink is down, only steps that export data pause.

Training plane (cloud): (5) buffered samples upload to a cloud object store (S3 / Azure Blob / GCS) landing zone, where they are de-identified, auto-labeled by a larger teacher model, and queued for human review. (6) A scheduled or triggered training pipeline on cloud GPUs (managed Kubernetes with GPU node pools, or a managed training service) fine-tunes the models, runs the evaluation harness against a golden test set, and — only on a pass — (7) exports each model to its serving format, optimizes it (TensorRT / ONNX), and publishes the versioned artifact to a model registry with its metrics, lineage, and a signature.

OTA plane (control channel): (8) a fleet-management controller watches the registry for newly-approved versions and orchestrates a staged rollout — canary devices first, then expanding rings — pushing a signed model bundle to each edge node’s agent, which verifies the signature, hot-loads the new version into Triton (which supports live model loads without a restart), runs a local smoke test, and reports health. A bad rollout is halted and rolled back automatically on health-signal regression. The whole loop — edge → samples → cloud training → registry → OTA → edge — is the system; no single box is the architecture.

Component breakdown

Component Technology Role in the platform Key configuration choices
Edge inference server NVIDIA Triton (edge node) Serve detector + OCR + classifier as an ensemble, locally, real-time Dynamic batching; TensorRT backend; model-control-mode=explicit for hot reload; shared-memory I/O
Frame ingest Custom service / DeepStream RTSP decode (NVDEC), pre-process, batch, hard-negative capture GPU decode; ROI crop on GPU; confidence-gated sampling
Edge runtime K3s or Docker on Jetson/industrial GPU Host Triton + agent + app on the edge box Single-node K3s; GPU operator; local PVC for sample buffer
Fleet / OTA agent Lightweight agent per device Pull signed bundles, verify, hot-load, smoke-test, report Mutual-TLS to controller; signature verify before load; canary self-test
Sample landing Cloud object store (S3/Blob/GCS) Receive hard negatives, de-identify, stage for labeling Lifecycle to cheap tier; SSE-KMS; PII blur at ingest
Training Cloud GPU (K8s GPU pool / managed) Fine-tune, evaluate, optimize to TensorRT/ONNX Spot/low-priority GPU; reproducible pipeline; eval gate
Model registry MLflow / cloud model registry Versioned, signed artifacts with lineage + metrics Immutable versions; signature + metrics as required metadata
Cloud serving (optional) NVIDIA Triton (cloud, autoscaled) Heavy/batch/fallback inference; teacher auto-labeling KServe/Triton on GPU pool; HPA on queue depth
CI/CD GitHub Actions or Jenkins Build/test model containers, run eval, gate promotion Eval-as-required-check; sign artifact on pass
IaC Terraform Cloud infra, GPU pools, registry, networking, edge bootstrap config Modules per plane; remote state; policy-as-code
Identity / SSO Okta or Microsoft Entra ID Operator SSO to consoles; service identity federation SAML/OIDC; per-role RBAC; device certs via SCEP
Secrets HashiCorp Vault Signing keys, registry creds, device enrollment secrets Transit engine for signing; short-TTL device tokens
Edge security CrowdStrike Falcon Runtime threat detection on edge nodes + cloud workers Lightweight sensor; tamper protection on the fleet
Cloud posture Wiz CSPM + data-security posture over the cloud account Scan object store for exposed PII; misconfig alerts
Observability Datadog or Dynatrace Fleet + cloud metrics, traces, model-quality telemetry Per-device agent; custom CV metrics; rollout dashboards
Edge delivery Akamai CDN for large signed bundles to remote sites Tiered distribution; signed URLs; resumable fetch
ITSM / approvals ServiceNow Change approvals for production rollouts, incident tickets Rollout = change request; auto-ticket on rollback

A few choices deserve the why, because they are the ones teams get wrong.

Why Triton on both ends, not two different serving stacks. It is tempting to write a bespoke inference loop on the edge (a Python script calling the model directly) and use a managed endpoint in the cloud. That gives you two codebases, two sets of pre/post-processing, and two ways for edge and cloud to silently diverge — so the model you evaluated in the cloud behaves differently in the field. Standardizing on Triton everywhere means the same model artifact, the same ensemble definition, the same input/output tensors run in both places. You evaluate in the cloud on the exact server that runs at the edge. Triton also gives you dynamic batching (critical for GPU utilization when frames arrive jittery), multi-framework support (TensorRT, ONNX, PyTorch in one server), concurrent model execution to share one GPU across three models, and — the feature that makes OTA clean — explicit model-control mode, where you load and unload model versions at runtime via an API without restarting the server or dropping in-flight requests.

Why centralize training but never centralize inference. Training wants what the cloud has: elastic, expensive GPUs you rent for hours not months, a governance plane, a place to pool data from the whole fleet so the model learns from all 140 sites’ edge cases at once. Inference wants what the edge has: physical proximity to the camera (latency), independence from the WAN (resilience), and no per-frame egress (cost). Putting them in the wrong place is the classic mistake — central inference re-introduces the latency and outage exposure you built the edge to avoid, and edge training means every device drifts on its own local data with no shared learning and no governance.

Why models are signed, immutable artifacts — not files on a share. The instant a model can stop a production line or approve a shipment, which exact weights are running where becomes an audit and security question. An attacker who can swap a model on an edge box can blind your damage-detector. So every model version is immutable in the registry, signed (the signing key lives in HashiCorp Vault’s transit engine, never on disk), and the edge agent refuses to load a bundle whose signature does not verify. Lineage — which dataset, which code commit, which eval run — is attached to every version, so you can answer “what is running on the Memphis cross-dock and why” instantly.

Why TensorRT optimization is non-negotiable at the edge. A model that runs fine on a cloud A100 may not hit 30 fps on a Jetson-class edge GPU at FP32. Converting to TensorRT with FP16 (or INT8 with calibration) fuses layers, picks optimal kernels for the specific GPU, and routinely delivers a 2–5x latency improvement — often the difference between making and missing the frame budget. The catch: a TensorRT engine is built for a specific GPU architecture, so the optimization step is per-target, and your registry stores a TensorRT engine per device class plus the portable ONNX as the source of truth.

Implementation guidance

Provision with IaC, and treat the edge bootstrap as a first-class deliverable. Use Terraform for the cloud plane — GPU node pools, the object-store landing zone with KMS encryption, the model registry, the OTA controller, and networking. The edge side is half-IaC: Terraform renders the per-device enrollment config (device identity, controller endpoint, Vault role), and a thin bootstrap (cloud-init or a golden image) brings up K3s with the NVIDIA GPU operator, Triton, and the agent on each box. Order matters: registry and signing (Vault transit) before the OTA controller, controller before you enroll any device, and the eval gate wired into CI before any model can be promoted — so there is never a path to the fleet that skips evaluation.

A minimal Triton model-repository layout for the edge ensemble communicates the intent:

model_repository/
  detector/            # TensorRT engine, dynamic batching on
    config.pbtxt
    1/model.plan
  ocr/                 # reads detector crops
    config.pbtxt
    1/model.plan
  damage_classifier/
    config.pbtxt
    1/model.plan
  pallet_pipeline/     # the ensemble that wires the three together
    config.pbtxt       # ensemble_scheduling: detector -> ocr + classifier
    1/

And the batching/serving knobs that keep the GPU fed and the latency bounded, in detector/config.pbtxt:

platform: "tensorrt_plan"
max_batch_size: 16
dynamic_batching {
  preferred_batch_size: [ 8, 16 ]
  max_queue_delay_microseconds: 3000   # wait <=3ms to form a batch
}
instance_group [ { count: 1  kind: KIND_GPU } ]

Run Triton in explicit model-control mode so OTA is a clean API call. Start the server with --model-control-mode=explicit; the agent then loads/unloads versions via the model API (POST /v2/repository/models/detector/load) without restarting Triton or dropping in-flight requests. The rollout is: stage the new version’s files (the agent fetched the signed bundle), verify the signature, call load, run a smoke test against a held-out frame set, and only then flip the application to the new version and unload the old. If the smoke test fails, you never flipped — the old version is still serving.

Identity and secrets: no long-lived keys on a device in a warehouse. Operators reach the cloud consoles (registry, controller, dashboards) through Okta or Entra ID SSO with per-role RBAC. Each edge device has a hardware-backed identity (a device certificate enrolled via SCEP at provisioning), authenticates to the OTA controller over mutual TLS, and pulls short-TTL credentials from HashiCorp Vault — so a stolen edge box does not yield a standing credential, and revoking a device is a single certificate revocation. The model-signing key never leaves Vault; signing happens through Vault’s transit engine, so even the CI pipeline that promotes a model never holds the raw key.

Wire the OTA bundle delivery through a CDN. Model bundles are large (hundreds of MB with TensorRT engines), and pushing them from a single cloud origin to 140 remote sites on thin uplinks is slow and fragile. Distribute the signed bundle through Akamai — tiered distribution caches it near each site, signed URLs scope access, and resumable fetch survives the flaky warehouse circuit. The controller hands the agent a signed URL; the agent pulls from the nearest edge, verifies the signature locally, and the origin serves each region once.

Build the data flywheel deliberately. The edge captures hard negatives — confidence-gated sampling (every frame the model was unsure about, every case where the OCR and the manifest disagreed) plus a small random baseline. These buffer locally (a bounded ring buffer on the device’s PVC so a long WAN outage cannot fill the disk) and drain to the cloud object store when the link is up. At ingest, faces and any incidental PII are blurred, a larger teacher model in cloud Triton auto-labels for human review, and the curated set feeds the next training run. This is what turns 140 deployed sites into 140 sources of training signal instead of 140 maintenance burdens.

Enterprise considerations

Security & Zero Trust. The fleet is the attack surface — physical boxes in buildings you do not fully control. Defenses, in layers: (a) signed, verified model artifacts so no unsigned weights ever load — an attacker who reaches the disk still cannot get Triton to serve a tampered model; (b) device identity + mutual TLS to the controller, with HashiCorp Vault issuing short-TTL credentials, so a stolen device cannot impersonate the fleet; © CrowdStrike Falcon sensors on both edge nodes and cloud training workers for runtime threat detection and tamper protection — the edge box should alert if someone opens a shell on it; (d) Wiz for cloud security and data posture, continuously scanning the object-store landing zone for accidentally-exposed frames containing PII and flagging IAM misconfigurations before they become incidents; (e) the inference path stays on localhost (shared memory / loopback gRPC) so camera frames never traverse a network the device does not own. Encrypt sample data at rest (SSE-KMS) and in transit, and de-identify at ingest, not later.

Cost optimization. Costs split across the three planes, and each has a lever. (1) Egress is the silent killer of pure-cloud CV — a single 4K/30 camera is ~15 Mbit/s; a 140-site fleet streaming everything is a bandwidth bill that dwarfs compute. Edge inference plus confidence-gated sampling deflects ~98% of frames from ever leaving the building — you upload the few hundred hard cases a day, not the millions of frames. (2) Cloud GPU on spot/low-priority for training, which is interruptible batch work, cuts training compute 60–80% versus on-demand. (3) Right-size the edge GPU to the model after TensorRT/INT8 — quantization often lets a cheaper edge SKU hit the frame budget, and the per-device saving multiplies across the fleet. (4) Cache bundles at the CDN edge so the cloud origin serves each new model once per region, not once per device. (5) Meter GPU-hours per training job and tag them to the model line for chargeback. The headline: the architecture exists in large part because the edge-vs-cloud split is the cost-optimal one — latency and resilience come along for free.

Scalability. Each plane scales independently. The edge scales by adding nodes per site (more cameras → more edge GPUs, or more model instances per GPU via Triton’s concurrent execution) — fleet size has no effect on per-device latency because inference is local. The training plane scales with cloud GPU pool size and parallel pipelines; pooling all sites’ data means the model improves with fleet size rather than fragmenting. The OTA plane is the one with real fan-out concern: rolling to 140 sites is a staged, ringed deployment (canary → 10% → 50% → 100%), rate-limited so you never hammer the origin or the controller, with the CDN absorbing the bundle bandwidth. The natural ceiling is operational, not technical — how fast you are willing to expand a rollout while watching health signals.

Failure modes — and they are different at the edge. The defining property is graceful degradation under WAN loss: if the uplink drops, inference keeps running on the last approved model, the line does not stop, and the sample buffer fills locally until the link returns and drains. Other modes and mitigations: a camera failure is detected by the ingest service (no frames) and alerted, with the lane flagged for manual handling. A bad model rollout is the scariest — caught by the per-device smoke test before the flip, and if a regression slips past into health metrics (confidence collapse, decision-rate anomaly), the controller auto-halts the rollout and rolls the affected ring back to the prior version, opening a ServiceNow incident automatically. An edge GPU OOM (three models is too much for the device memory) is prevented by sizing model instances to VRAM and is caught in pre-rollout testing, not production. A poisoned data flywheel — bad auto-labels degrading the next model — is gated by human review on the labeling step and the eval harness, which will fail a model that regressed on the golden set, blocking promotion entirely.

Reliability & rollout safety (the OTA contract). Treat every production rollout as a change with a contract: it is a ServiceNow change request requiring approval, it deploys to a canary ring first and bakes for a defined window, it expands only while health signals hold, and it is cancellable and auto-reverting at any ring. Because Triton hot-loads versions without a restart and the old version stays loaded until the new one passes its smoke test, the per-device blast radius of a bad model is one smoke-test failure, not an outage. A pragmatic enterprise target: a new model reaches the full fleet over 24–48 hours of staged rollout, any single device’s update is a sub-minute hot-swap, and a regression is detected and rolled back within one ring’s bake window — never a fleet-wide bad deploy.

Observability. Instrument all three planes into Datadog or Dynatrace with one pane that spans the fleet. Per-device: inference latency (p95/p99), frames-per-second sustained, GPU utilization and VRAM, dropped-frame rate, and the model version currently loaded (so you can answer “what is running where” from the dashboard). Model-quality telemetry the business cares about: per-class confidence distributions, OCR-vs-manifest disagreement rate, damage-flag rate — drift in these is your early warning that the model needs retraining before accuracy visibly degrades. Rollout telemetry: a live dashboard of which ring each device is in, smoke-test pass rate, and health deltas during a rollout, so an operator can hit cancel the moment a canary looks wrong. Trace the cloud training pipeline end to end so a failed promotion is debuggable.

Governance. Pin model versions explicitly per device class (never a floating “latest”) so behavior cannot drift under you. Every version carries lineage — dataset snapshot, code commit, eval metrics, signer — so an auditor can reconstruct exactly how the model running on any device was produced. Promotion to production goes through the eval gate in CI (GitHub Actions / Jenkins) as a required check and a ServiceNow approval; there is no manual path to the fleet. Keep a retention-bounded record of frames captured for training (with the PII-blur and a right-to-be-forgotten path, since warehouse and store footage is personal data). Apply policy-as-code in Terraform to deny an edge enrollment without a device certificate and an object store without encryption.

Reference enterprise example

Cordon Freight, a fictional North American less-than-truckload carrier (~11,000 employees, 140 cross-dock facilities), built this platform to eliminate manual pallet scanning and the chargebacks that came from mis-scans. Their workload: ~1,800 cameras fleet-wide (12–16 per cross-dock at the loading bays), each 4K/30, running a three-model ensemble — a label/pallet detector, an OCR model for the shipping labels, and a damage classifier.

Decisions they made. They standardized on NVIDIA Triton at the edge and in the cloud so the evaluated artifact was the deployed artifact. Each cross-dock got two industrial edge GPU nodes running single-node K3s, Triton in model-control-mode=explicit, and the OTA agent; the three models ran as a Triton ensemble with dynamic batching, optimized to TensorRT INT8 to sustain 30 fps per camera within a ~40 ms p95 inference budget. Frames stayed on the box (loopback gRPC); only confidence-gated hard negatives — about 700 a day per site — drained to an S3 landing zone, blurred for PII at ingest, auto-labeled by a teacher model in cloud Triton, and human-reviewed. Training ran on a spot GPU pool on managed Kubernetes, gated by an eval harness wired into GitHub Actions as a required check; passing models were signed via HashiCorp Vault transit and published to an MLflow registry. Rollouts were staged through a fleet controller (canary cross-dock → 10% → 50% → 100% over ~36 hours), each a ServiceNow change request, with signed bundles delivered via Akamai to survive thin site uplinks. Operators used Entra ID SSO; devices enrolled with SCEP certs and mutual TLS. CrowdStrike Falcon ran on every edge node and training worker; Wiz watched the S3 landing zone for exposed PII; Datadog unified fleet and cloud telemetry.

The numbers. ~1,800 cameras × 30 fps is ~54,000 inferences/second across the fleet, all served locally; zero per-frame egress. Sample upload was ~100,000 frames/day fleet-wide (vs. the ~4.6 billion frames/day the cameras actually saw — a ~99.998% deflection). Monthly run cost landed near ₹46 lakh (~$55,000): cloud training on spot GPUs ~$14,000, S3 + labeling + teacher inference ~$9,000, OTA/controller/Akamai bundle delivery ~$6,000, observability + security tooling (Datadog, Falcon, Wiz) ~$11,000, registry/CI/networking ~$5,000, and a reserve. The edge hardware was capex (amortized), and the avoided egress alone — had they streamed every frame to the cloud — would have exceeded the entire run cost several times over.

The outcome. Pallet mis-scans fell ~70%, and because every decision was logged with the model version and a confidence score, the chargeback-dispute team could prove what the system saw at the bay — which closed disputes in their favor that previously defaulted to the customer. A WAN outage at the Memphis cross-dock during a peak Monday was a non-event: inference ran on the last approved model through the four-hour circuit failure, the sample buffer drained when it returned, and not a single pallet went unscanned. When a retrained damage classifier regressed on dark-pallet edge cases, the canary cross-dock’s smoke test plus a confidence-drop alert tripped the controller, which auto-rolled-back that ring and filed a ServiceNow incident before the model ever reached a second site — the failure the architecture exists to contain, contained.

When to use it

Use this architecture when you have computer vision in the physical world with a hard latency budget, a fleet of sites that cannot depend on a WAN link for uptime, per-frame data volumes that make cloud streaming uneconomical, and a governance requirement that you know exactly which model runs where. That covers most enterprise edge-CV demand — logistics scanning, manufacturing inspection, retail loss-prevention, smart-building safety, and field equipment monitoring.

Trade-offs to accept. The split adds real operational surface: you now run a fleet (provisioning, enrollment, health, OTA) and a training plane and the control channel between them. Edge hardware is capex and a refresh cycle, and TensorRT optimization is per-GPU-architecture work you redo when the hardware changes. The data flywheel needs human-in-the-loop labeling to stay clean. And edge debugging is harder — the box is in a warehouse, so your observability has to be good enough to diagnose remotely.

Anti-patterns. (1) Centralizing inference — re-introduces the latency and outage exposure the edge exists to remove. (2) Edge-managed training — every device drifts on its own data with no shared learning or governance. (3) Unsigned models on a file share — an attacker swaps your detector and you cannot prove what ran. (4) Skipping TensorRT/quantization — you miss the frame budget on affordable edge hardware. (5) Manual flashing for updates — does not scale past a pilot and means a truck roll per model change. (6) Streaming every frame to the cloud — an egress bill that dwarfs the compute and dies on every uplink blip. (7) No eval gate before promotion — a regressed model reaches the fleet, and you find out from chargebacks.

Alternatives, and when they win. If your cameras live in one facility with a fat, reliable link and a soft latency budget, cloud-only inference on autoscaled Triton is simpler — skip the fleet entirely. If you have a handful of devices and models that genuinely never change, a hand-managed edge deployment is fine until it isn’t; graduate to OTA when the fleet or the change cadence grows. If your latency budget is generous and frames are sparse, a managed edge ML service (the cloud providers’ device-management offerings) trades some control for less plumbing. And if the vision task is simple enough to run on a CPU or an NPU-class accelerator, you may not need a GPU edge box or Triton at all — match the serving substrate to the model’s real compute demand. The architecture here is the destination for fleet-scale, real-time, governed edge CV — not always the starting line.

Computer VisionEdgeNVIDIA TritonMLOpsArchitectureEnterprise
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading