Skip to main content

API Telemetry Runbook

Relay emits first-class telemetry for every API execution. Use this runbook to understand metric surfaces, dashboards, and smoke checks.

Metric Surface

MetricTagsSourceNotes
relay.api.requests_totalmethod, endpoint, status, tenant_idTelemetry middlewareRoute patterns + tenant short codes keep cardinality low.
relay.api.request_duration_secondsmethod, endpoint, tenant_idTelemetry middlewarePrometheus histograms → Datadog distributions.
relay.execution.durationtenant, rufid, runtime, environmentExecution routerMillisecond timing for dashboards + alerts.
relay.execution.queue_delaytenant, rufid, environmentExecution routerNon-zero when async backlog exists.
relay.execution.cost.estimatedtenant, rufidExecution routerGauge showing last known cost estimate.
relay.execution.memorytenant, rufid, statusExecution routerPeak MB per run.

Prometheus metrics live under /monitoring/metrics. Datadog receives the relay.* namespace surfaced in the Relay / API Telemetry Baseline dashboard (deployment/datadog/DASHBOARD_API_TELEMETRY.json).

Middleware & Instrumentation

  • TelemetryMiddleware collects HTTP timings, success/error counts, and forwards DogStatsD samples.
  • Execution routes enrich responses with X-Relay-* headers so tenant/RUFID tags survive streaming.
  • _record_execution_metrics emits the canonical execution snapshot (duration, queue delay, cost, memory) and relays the payload to Datadog.

Dashboards & Alerts

Run scripts/monitoring/setup_production_monitoring.py --component datadog to provision:

  • Latency percentile heatmap
  • Error-rate by tenant
  • Queue-delay sparkline & backlog gauge
  • Cost estimate widgets

Alerts cover latency breaches, queue delay spikes, error rates >2%, and missing data for >10 minutes.

Smoke & CI

scripts/monitoring/api_telemetry_smoke.sh executes after staging deploys and nightly in production:

  1. Trigger a sample execution for the target tenant.
  2. Query Datadog for recent relay.execution.duration{tenant:<id>, rufid:<prefix>} series.
  3. Confirm the dashboard exists and contains data.

Failures exit non-zero, blocking promotion.

Warehouse Exports

Nightly Airflow jobs populate analytics.api_function_daily with latency, queue delay, cost, and cache-hits. Great Expectations (data_quality/api_telemetry.expectations.json) validates the dataset; monitor runs via scripts/monitoring/validate_monitoring.py.

Configuration

VariablePurpose
DATADOG_API_KEY, DATADOG_APP_KEYDatadog credentials for exporter + monitoring script.
ANALYTICS_DATABASE_URLWarehouse connection string (falls back to DATABASE_URL).
ANALYTICS_TABLE_NAMEOptional override default analytics.api_function_daily.
TELEMETRY_ENV_PATHExplicit env path for local DAG runs.

Troubleshooting

SymptomSteps
Dashboard flatlinesRun smoke script, verify DogStatsD agent reachable (DD_AGENT_HOST).
Missing tenant tagsEnsure auth middleware sets request.state.user and execution responses include X-Relay-Tenant.
High cardinality warningConfirm new metrics truncate RUFIDs; avoid raw UUID tags.
Queue metrics missingAsync path must stamp request_time; missing metadata yields queue_delay_ms=None.

Ownership

  • Primary: Platform Observability (#relay-telemetry)
  • Secondary: Core Platform Runtime (#relay-runtime)
  • Escalate security anomalies to security@deployrelay.com

Update this runbook whenever metrics, tags, or alert thresholds change.