End-User and Synthetic Monitoring on AWS: CloudWatch RUM and Synthetics Canaries

Backend dashboards lie to you about the frontend. Your ALB target group is healthy, p99 server latency is flat, X-Ray shows no faults - and a third of your users are staring at a blank page because a CDN-served bundle 404’d or a third-party tag blew up the main thread. The two signals that close that blind spot are passive and active. CloudWatch RUM (Real User Monitoring) is passive: a JavaScript web client reports what real browsers actually experienced - Core Web Vitals, JS errors, fetch failures, navigation timing. Synthetics canaries are active: scheduled headless-browser scripts that exercise your site from AWS regions on a fixed cadence, so you find out it’s down at 3am instead of from a customer at 9am. Run both, feed their metrics into the same SLO math you already use for backend services, and you get a frontend you can actually put on call.

1. Instrumenting the browser with the CloudWatch RUM web client

RUM has two halves: an app monitor (the AWS-side resource that ingests events, emits metrics, and stores raw events in CloudWatch Logs) and the web client (aws-rum-web, the npm package or CDN snippet that runs in the browser and posts events to the data plane). The web client authenticates to the RUM PutRumEvents API using temporary, unauthenticated credentials from a Cognito identity pool - that is the supported, low-friction path, and it is why you never ship long-lived keys to the browser.

When you create an app monitor with --cw-log-enabled and let RUM manage the identity pool, AWS provisions the Cognito identity pool, the unauthenticated IAM role, and the role policy for you. Create it first:

aws rum create-app-monitor \
  --name kloudvin-web \
  --domain app.kloudvin.com \
  --cw-log-enabled \
  --app-monitor-configuration '{
    "AllowCookies": true,
    "EnableXRay": true,
    "SessionSampleRate": 0.2,
    "Telemetries": ["errors", "performance", "http"],
    "FavoritePages": ["/", "/checkout"]
  }' \
  --region eu-west-1

SessionSampleRate: 0.2 instruments 20% of sessions - the single biggest cost and volume lever (section 7). Telemetries selects which plugins the client loads: errors (JS errors), performance (navigation, resource, and Web Vitals), and http (XHR/fetch instrumentation). EnableXRay: true makes the client attach trace headers to instrumented HTTP calls so RUM sessions stitch to backend traces (section 5).

Grab the generated identity pool ID and app monitor ID from the response - the snippet needs both:

aws rum get-app-monitor --name kloudvin-web --region eu-west-1 \
  --query 'AppMonitor.{Id:Id,Pool:AppMonitorConfiguration.IdentityPoolId}'

Now install the client. Prefer the npm module over the CDN loader in a real app - you get version pinning, tree-shaking, and the ability to gate it behind consent. The config object passed to AwsRum mirrors the app monitor:

import { AwsRum } from 'aws-rum-web';

const config = {
  sessionSampleRate: 0.2,
  identityPoolId: 'eu-west-1:11111111-2222-3333-4444-555555555555',
  endpoint: 'https://dataplane.rum.eu-west-1.amazonaws.com',
  telemetries: ['errors', 'http', ['performance', { recordAllTypes: ['css','script','img','fetch'] }]],
  allowCookies: true,
  enableXRay: true
};

const APPLICATION_ID = 'a1b2c3d4-...';
const APPLICATION_VERSION = '1.0.0';
const APPLICATION_REGION = 'eu-west-1';

// Hold the instance so you can record custom errors/events later.
export const awsRum = new AwsRum(APPLICATION_ID, APPLICATION_VERSION, APPLICATION_REGION, config);

Load the client as early as possible in the document head, before your app bundle. Web Vitals like Largest Contentful Paint and the navigation-timing entries are emitted by the browser early in page load; if the RUM client mounts after first paint it will miss them. Early load is the difference between a populated Web Vitals dashboard and an empty one.

A note on allowCookies: when true, the client stores a session ID and user ID in cookies so a session survives across page navigations and the per-session metrics (below) are accurate. With false, every page load looks like a new session, inflating SessionCount and breaking per-session ratios. Set it according to your consent posture, but understand what you lose.

2. Capturing Core Web Vitals, JS errors, and session/page performance

Once the performance, errors, and http telemetries are live, the app monitor emits a fixed set of CloudWatch metrics into the AWS/RUM namespace, every one carrying an application_name dimension equal to the monitor name. These are the metrics your SLOs and alarms key off. The ones that matter:

Metric	Unit	What it measures
`WebVitalsLargestContentfulPaint`	ms	LCP - loading; the headline “is it fast” Web Vital
`WebVitalsCumulativeLayoutShift`	None	CLS - visual stability
`WebVitalsInteractionToNextPaint`	ms	INP - responsiveness (replaced FID as a Core Web Vital)
`WebVitalsFirstInputDelay`	ms	FID - legacy responsiveness signal
`PerformanceNavigationDuration`	ms	full navigation timing for a page load
`NavigationSatisfiedTransaction`	Count	navigations under the 2000ms Apdex objective
`NavigationToleratedTransaction`	Count	navigations between 2000ms and 8000ms
`NavigationFrustratedTransaction`	Count	navigations slower than the 8000ms frustrating threshold
`JsErrorCount`	Count	uncaught JS error events ingested
`Http5xxCount` / `Http4xxCount`	Count	fetch/XHR responses by status class
`PageViewCount`	Count	page-view events
`SessionCount`	Count	new sessions started

A few things follow from this table. First, the Apdex split (Satisfied/Tolerated/Frustrated) is pre-bucketed against the standard 2000ms/8000ms thresholds, which is exactly the shape you want for a navigation-latency SLI - a count-based good/total ratio, not a quantile you have to estimate. Second, WebVitals* metrics are emitted with the value distribution, so you query them with the p75 statistic - the percentile Google’s Core Web Vitals thresholds are defined against (LCP good <= 2.5s, INP good <= 200ms, CLS good <= 0.1 at p75).

To catch errors your window.onerror handler swallows or that originate in a try/catch, record them explicitly through the held client:

try {
  await loadCheckout();
} catch (err) {
  awsRum.recordError(err);   // becomes a com.amazon.rum.js_error_event -> JsErrorCount
  showFallbackUI();
}

The raw events behind these aggregates land in a CloudWatch Logs log group named aws-rum/<app-monitor-id> (because you set --cw-log-enabled). That is where you go from “JsErrorCount spiked” to “which file, which line, which browser” - query it with Logs Insights:

fields @timestamp, event_details.message, event_details.fileName, event_details.lineNumber, metadata.browserName
| filter event_type = "com.amazon.rum.js_error_event"
| sort @timestamp desc
| limit 50

3. Authoring Synthetics canaries: heartbeat, API, and broken-link blueprints

RUM tells you what users hit. Canaries tell you what would happen if a user hit you right now, including paths with no live traffic. A canary is a Lambda function built from your script plus a Synthetics runtime layer, invoked on a schedule, recording screenshots/HAR/logs to S3 and publishing pass/fail metrics. AWS ships blueprints for the three most common shapes; you can author them in the console, but encode them as IaC so they’re reviewable and reproducible.

Pin the runtime version explicitly. The current Node/Puppeteer line is syn-nodejs-puppeteer-12.0 and the Playwright line is syn-nodejs-playwright-4.0; AWS deprecates old runtimes on a published schedule, so a hard-coded version is a thing you must keep current, not set-and-forget.

Heartbeat - load a URL, assert it rendered, capture a screenshot. The smallest useful canary:

const { URL } = require('url');
const synthetics = require('Synthetics');
const log = require('SyntheticsLogger');

const pageLoadBlueprint = async function () {
  const url = 'https://app.kloudvin.com/';
  const page = await synthetics.getPage();

  // executeStep names the step -> SuccessPercent/Duration get a StepName dimension
  await synthetics.executeStep('loadHomepage', async function () {
    const response = await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 });
    if (!response || response.status() < 200 || response.status() > 299) {
      throw new Error(`Failed to load ${url}: status ${response && response.status()}`);
    }
  });

  await synthetics.executeStep('verifyContent', async function () {
    await page.waitForSelector('#app', { timeout: 15000 });
  });
};

exports.handler = async () => {
  return await pageLoadBlueprint();
};

API canary - exercise an endpoint and validate status, headers, and body without a browser. Use executeHttpStep, which is purpose-built for request/response assertions and emits per-step 2xx/4xx/5xx metrics:

const synthetics = require('Synthetics');

const apiCanary = async function () {
  const validate = async function (res) {
    return new Promise((resolve, reject) => {
      if (res.statusCode !== 200) {
        reject(new Error(`${res.statusCode} ${res.statusMessage}`));
        return;
      }
      let body = '';
      res.on('data', (d) => { body += d; });
      res.on('end', () => {
        const json = JSON.parse(body);
        if (json.status !== 'ok') reject(new Error(`bad payload: ${body}`));
        else resolve();
      });
    });
  };

  await synthetics.executeHttpStep('GET /healthz', {
    hostname: 'api.kloudvin.com',
    method: 'GET',
    path: '/healthz',
    port: 443,
    protocol: 'https:',
    headers: { 'User-Agent': synthetics.getCanaryUserAgentString() }
  }, validate);
};

exports.handler = async () => {
  return await apiCanary();
};

Broken-link checker - crawl a page, follow its links, and fail when any returns an error. The runtime provides LinkChecker/addLinks/checkLinks:

const synthetics = require('Synthetics');
const SyntheticsLink = require('SyntheticsLink');

const brokenLinkChecker = async function () {
  const baseUrl = 'https://app.kloudvin.com/';
  const syntheticsConfiguration = synthetics.getConfiguration();
  syntheticsConfiguration.setConfig({ continueOnStepFailure: true });

  let page = await synthetics.getPage();
  await synthetics.executeStep('openBase', async () => {
    await page.goto(baseUrl, { waitUntil: 'domcontentloaded', timeout: 30000 });
  });

  const hrefs = await page.$$eval('a[href]', (as) => as.map((a) => a.href).slice(0, 20));
  for (const href of hrefs) {
    const link = new SyntheticsLink(href);
    await synthetics.executeStep(`link:${href}`, async () => {
      const res = await page.goto(href, { waitUntil: 'domcontentloaded', timeout: 20000 });
      link.withStatusCode(res.status()).withText(href);
      await synthetics.addLinkToReport(link);
      if (res.status() < 200 || res.status() > 399) {
        throw new Error(`Broken link ${href}: ${res.status()}`);
      }
    });
  }
};

exports.handler = async () => {
  return await brokenLinkChecker();
};

Provision the canary itself with Terraform so the schedule, runtime, artifact bucket, and alarm live together:

resource "aws_synthetics_canary" "homepage_heartbeat" {
  name                 = "kv-homepage"            # <= 21 chars, lowercase
  artifact_s3_location = "s3://${aws_s3_bucket.canary_artifacts.id}/homepage/"
  execution_role_arn   = aws_iam_role.canary.arn
  runtime_version      = "syn-nodejs-puppeteer-12.0"
  handler              = "pageLoadBlueprint.handler"
  zip_file             = data.archive_file.heartbeat.output_path

  schedule {
    expression = "rate(1 minute)"   # cadence == your detection latency floor
  }

  run_config {
    timeout_in_seconds = 60
    memory_in_mb       = 1024
    active_tracing     = true       # write canary spans to X-Ray
  }

  success_retention_period = 7      # days of artifacts kept for passing runs
  failure_retention_period = 31     # keep failures longer for forensics

  start_canary = true
}

4. Scheduling canaries across regions and alerting on SuccessPercent

Every canary publishes to the CloudWatchSynthetics namespace. Two metrics are always emitted, both with and without a CanaryName dimension: SuccessPercent (percentage of runs in the period that passed) and Duration (run time in ms). Canaries using executeStep/executeHttpStep additionally emit SuccessPercent and Duration per step (CanaryName + StepName), and HTTP-step canaries emit 2xx/4xx/5xx/Failed. SuccessPercent is what you page on - it’s already a clean availability percentage.

Two scheduling decisions drive coverage:

Cadence is your detection-latency floor. A rate(1 minute) canary cannot detect an outage faster than ~1 minute. Tighter cadence buys faster detection at linear cost (section 7). Run business-critical journeys at 1 minute, secondary paths at 5.
Geography is real. A canary in eu-west-1 will not catch a CloudFront edge or Route 53 issue affecting us-west-2 users. Deploy the same canary to several regions and alarm per region. Single-region synthetic monitoring gives false confidence the moment the problem is regional or DNS-level.

The alarm is the product. Alarm on SuccessPercent below a threshold so any run-level failure trips it, and use breaching as the missing-data treatment so a canary that stops running entirely (a broken deploy, an IAM regression) is itself an alert:

resource "aws_cloudwatch_metric_alarm" "homepage_down" {
  alarm_name          = "kv-homepage-down-eu-west-1"
  namespace           = "CloudWatchSynthetics"
  metric_name         = "SuccessPercent"
  dimensions          = { CanaryName = aws_synthetics_canary.homepage_heartbeat.name }
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 3
  datapoints_to_alarm = 2          # 2 of last 3 runs failing -> page (debounces a flaky run)
  comparison_operator = "LessThanThreshold"
  threshold           = 90         # < 90% of runs in the window succeeded
  treat_missing_data  = "breaching"
  alarm_actions       = [aws_sns_topic.pager.arn]
  ok_actions          = [aws_sns_topic.pager.arn]
}

The datapoints_to_alarm = 2 of evaluation_periods = 3 (an “M of N” alarm) is what keeps a single transient failure - one slow third party, one TCP reset - from waking someone, while a genuine outage (two consecutive misses) still pages within ~2 minutes.

5. Correlating RUM and canary data with X-Ray traces for root cause

Detection without correlation just tells you that something broke. The win is jumping from a degraded frontend signal straight to the offending backend span. AWS wires three planes together if you opt in.

On the RUM side, enableXRay: true makes the web client generate an X-Ray trace header for each instrumented fetch/XHR and emit a trace segment for the browser-side request. When that request reaches an X-Ray-instrumented backend (ALB, API Gateway, Lambda, your ADOT-instrumented EKS service), the IDs line up and the X-Ray service map shows a node for your client application feeding into the backend graph - so a JS Http5xxCount spike is one click from the failing downstream segment.

On the canary side, active_tracing = true (the run_config flag above) makes the canary’s outbound calls participate in X-Ray. A failing API canary then carries a trace ID, and the trace shows exactly which hop errored - DNS, TLS, the ALB, or the service - turning “the API canary is red” into “the orders service returned a 503 on the database call.”

Correlation only works if the trace context propagates end to end. If an intermediary (a misconfigured CloudFront behavior, an ALB rule, a service that doesn’t forward X-Amzn-Trace-Id) drops the header, the IDs break and you get orphaned segments instead of one connected map. Verify propagation deliberately - it is the single most common reason RUM-to-X-Ray correlation silently does nothing.

The practical triage loop becomes: alarm fires on canary SuccessPercent or RUM Http5xxCount -> open the failed run / RUM session -> follow its trace ID into the X-Ray service map -> read the faulting segment. Minutes, not a war-room.

6. Defining frontend availability and latency SLOs from RUM metrics

You already do SLOs for backend services as good/valid event ratios. RUM lets you extend the same discipline to the browser, with two user-facing SLIs.

Availability SLI - page loads not broken by errors. The cleanest proxy from the default metrics is server-error rate experienced in the browser: good = page views - sessions/views with errors, bad = Http5xxCount (+ JsErrorCount for hard failures). Compute a success ratio with a CloudWatch metric-math expression over the AWS/RUM namespace:

# Metric math on AWS/RUM (dimension application_name = kloudvin-web)
m1 = JsErrorCount        (Sum)
m2 = Http5xxCount        (Sum)
m3 = PageViewCount       (Sum)
e1 = 100 * (1 - (m1 + m2) / m3)     # frontend success %, the SLI you SLO against

Latency SLI - navigations that felt fast. The Apdex buckets are tailor-made: good = NavigationSatisfiedTransaction, valid = Satisfied + Tolerated + Frustrated. That is a count-based ratio against the 2000ms objective with no quantile estimation:

s = NavigationSatisfiedTransaction (Sum)
t = NavigationToleratedTransaction (Sum)
f = NavigationFrustratedTransaction (Sum)
fast_pct = 100 * s / (s + t + f)    # % of navigations under the 2000ms Apdex objective

Separately, track the Core Web Vitals at p75 against Google’s thresholds, because that is how Search and field-data tools judge you: LCP good <= 2500ms, INP good <= 200ms, CLS good <= 0.1. Alarm on the p75 statistic of WebVitalsLargestContentfulPaint and WebVitalsInteractionToNextPaint.

Then put a multi-window burn-rate discipline on the availability SLI exactly as you would for a backend SLO: a fast-burn alarm (e.g. 14.4x burn over 1h) for “we’re torching the budget now,” and a slow-burn alarm (e.g. 6x over 6h) for “a low-grade regression is quietly eating the month.” Synthetic SuccessPercent is the leading indicator (catches outages with zero live traffic); RUM ratios are the ground truth (what users actually got). Page on the canary, report the SLO on RUM.

7. Sampling, data retention, and managing RUM/Synthetics cost

Both services bill on volume, and both have a single dominant lever.

RUM is priced per RUM event ingested (every page view, error, navigation, resource, and Web Vital is an event), so cost scales with traffic x events-per-session x SessionSampleRate. Controls, highest leverage first:

SessionSampleRate is the master dial. Dropping from 1.0 to 0.2 cuts ingested events ~5x while still giving statistically sound Web Vitals and error rates for a high-traffic app. Sample lower on a busy consumer site, higher (up to 1.0) on a low-traffic internal app where you need every session.
Trim telemetries and resource recording. The performance plugin’s recordAllTypes/eventLimit control how many resource-timing events each page emits; an unbounded list on an asset-heavy page is a quiet event-multiplier. Record the resource types you’ll actually query.
Raw-event logs cost separately. --cw-log-enabled writes every sampled event to CloudWatch Logs (storage + ingestion) on top of RUM ingestion. Keep it on for debuggability, but put a retention policy on the aws-rum/<id> log group instead of the default never-expire.

Synthetics is priced per canary run. Cost scales linearly with (number of canaries) x (runs per hour) x (regions) - which is precisely the cross-product you expand for coverage in section 4. So treat cadence as a budget: 1-minute on a handful of revenue-critical journeys, 5- to 15-minute on the long tail. The two artifact retention knobs (success_retention_period, failure_retention_period) govern S3 storage; the asymmetric default in our Terraform - 7 days for passes, 31 for failures - keeps forensics cheap without hoarding green screenshots.

Cost on both services is a deliberate trade against statistical power and detection latency, not an afterthought. Set SessionSampleRate and canary cadence per application based on traffic and how much downtime that journey can tolerate - one global default is always wrong somewhere.

Verify

Confirm the whole pipeline is actually emitting before you trust the dashboards.

Check the app monitor exists, is ingesting, and is logging raw events:

aws rum get-app-monitor --name kloudvin-web --region eu-west-1 \
  --query 'AppMonitor.{State:State,Log:DataStorage.CwLog.CwLogEnabled,Sample:AppMonitorConfiguration.SessionSampleRate}'

Confirm RUM metrics are flowing into AWS/RUM (load the site in a browser first, then give it a minute):

aws cloudwatch list-metrics --namespace AWS/RUM \
  --dimensions Name=application_name,Value=kloudvin-web \
  --region eu-west-1 --query 'Metrics[].MetricName'

Check p75 LCP over the last hour - this is your headline Web Vital:

aws cloudwatch get-metric-statistics --namespace AWS/RUM \
  --metric-name WebVitalsLargestContentfulPaint \
  --dimensions Name=application_name,Value=kloudvin-web \
  --start-time "$(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --period 3600 --extended-statistics p75 --region eu-west-1

Confirm the canary ran and check its last result, then verify SuccessPercent exists:

aws synthetics get-canary-runs --name kv-homepage --max-results 3 --region eu-west-1 \
  --query 'CanaryRuns[].{Status:Status.State,Started:Timeline.Started}'

aws cloudwatch list-metrics --namespace CloudWatchSynthetics \
  --dimensions Name=CanaryName,Value=kv-homepage --region eu-west-1 \
  --query 'Metrics[].MetricName'

Prove the alarm actually pages: temporarily point the heartbeat at a path that returns 500 (or stop the canary) and confirm the alarm transitions to ALARM and SNS delivers. An alarm you have never seen fire is a hypothesis, not a control.

aws cloudwatch describe-alarms --alarm-names kv-homepage-down-eu-west-1 \
  --region eu-west-1 --query 'MetricAlarms[].{State:StateValue,Reason:StateReason}'

Enterprise scenario

A retail platform team ran a single-page checkout behind CloudFront, with a eu-west-1 heartbeat canary on rate(5 minutes) and RUM at 100% sampling. During a Black Friday dry run they hit two problems at once. First, RUM ingestion costs were projected to roughly 5x their normal monthly bill under peak traffic, because 100% sampling on a high-traffic site multiplies events per session by the navigation count. Second - and worse - a canary deploy in week one had silently failed to roll out the new runtime, the canary stopped running, and because their alarm treated missing data as notBreaching, the monitor going dark looked identical to “all healthy.” They found it only because a manual smoke test caught a real checkout 500 the dead canary had missed.

The constraint: cut RUM cost without going blind on Web Vitals, and make canary silence itself an incident, across the three regions their customers actually came from.

The fix had three parts. They dropped SessionSampleRate to 0.15 (still tens of thousands of sampled sessions/day - more than enough for stable p75 Web Vitals and error rates), put a 14-day retention policy on the raw-event log group, deployed the checkout heartbeat to eu-west-1, eu-central-1, and us-east-1 at rate(1 minute), and changed every alarm to treat_missing_data = "breaching" so a canary that stops emitting pages on its own. The alarm that turned “we’ll find out next dry run” into “we find out in two minutes”:

resource "aws_cloudwatch_metric_alarm" "checkout_down" {
  alarm_name          = "checkout-down-${var.region}"
  namespace           = "CloudWatchSynthetics"
  metric_name         = "SuccessPercent"
  dimensions          = { CanaryName = "checkout-${var.region}" }
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 3
  datapoints_to_alarm = 2
  comparison_operator = "LessThanThreshold"
  threshold           = 90
  treat_missing_data  = "breaching"   # the line that made silence an incident
  alarm_actions       = [aws_sns_topic.checkout_pager.arn]
}

Net result: projected RUM spend dropped ~85% against the 100%-sampling baseline with no loss of confidence in the p75 Web Vitals SLO, and on the actual Black Friday a CloudFront edge issue affecting only us-east-1 users tripped the us-east-1 canary nine minutes before the first customer complaint - exactly the regional failure the single-region setup would have missed.

End-User and Synthetic Monitoring on AWS: CloudWatch RUM and Synthetics Canaries

1. Instrumenting the browser with the CloudWatch RUM web client

2. Capturing Core Web Vitals, JS errors, and session/page performance

3. Authoring Synthetics canaries: heartbeat, API, and broken-link blueprints

4. Scheduling canaries across regions and alerting on SuccessPercent

5. Correlating RUM and canary data with X-Ray traces for root cause

6. Defining frontend availability and latency SLOs from RUM metrics

7. Sampling, data retention, and managing RUM/Synthetics cost

Verify

Enterprise scenario

Checklist

Written by Vinod

Comments

Keep Reading

Application Insights with OpenTelemetry: Distributed Tracing and Adaptive Sampling for .NET

Distributed Tracing on AWS with X-Ray: Service Maps, Segments, and ADOT on EKS

Azure Monitor Managed Prometheus and Managed Grafana for AKS, End to End