API Telemetry Runbook
Relay emits first-class telemetry for every API execution. Use this runbook to understand metric surfaces, dashboards, and smoke checks.
Metric Surface
| Metric | Tags | Source | Notes |
|---|---|---|---|
relay.api.requests_total | method, endpoint, status, tenant_id | Telemetry middleware | Route patterns + tenant short codes keep cardinality low. |
relay.api.request_duration_seconds | method, endpoint, tenant_id | Telemetry middleware | Prometheus histograms → Datadog distributions. |
relay.execution.duration | tenant, rufid, runtime, environment | Execution router | Millisecond timing for dashboards + alerts. |
relay.execution.queue_delay | tenant, rufid, environment | Execution router | Non-zero when async backlog exists. |
relay.execution.cost.estimated | tenant, rufid | Execution router | Gauge showing last known cost estimate. |
relay.execution.memory | tenant, rufid, status | Execution router | Peak MB per run. |
Prometheus metrics live under /monitoring/metrics. Datadog receives the relay.* namespace surfaced in the Relay / API Telemetry Baseline dashboard (deployment/datadog/DASHBOARD_API_TELEMETRY.json).
Middleware & Instrumentation
TelemetryMiddlewarecollects HTTP timings, success/error counts, and forwards DogStatsD samples.- Execution routes enrich responses with
X-Relay-*headers so tenant/RUFID tags survive streaming. _record_execution_metricsemits the canonical execution snapshot (duration, queue delay, cost, memory) and relays the payload to Datadog.
Dashboards & Alerts
Run scripts/monitoring/setup_production_monitoring.py --component datadog to provision:
- Latency percentile heatmap
- Error-rate by tenant
- Queue-delay sparkline & backlog gauge
- Cost estimate widgets
Alerts cover latency breaches, queue delay spikes, error rates >2%, and missing data for >10 minutes.
Smoke & CI
scripts/monitoring/api_telemetry_smoke.sh executes after staging deploys and nightly in production:
- Trigger a sample execution for the target tenant.
- Query Datadog for recent
relay.execution.duration{tenant:<id>, rufid:<prefix>}series. - Confirm the dashboard exists and contains data.
Failures exit non-zero, blocking promotion.
Warehouse Exports
Nightly Airflow jobs populate analytics.api_function_daily with latency, queue delay, cost, and cache-hits. Great Expectations (data_quality/api_telemetry.expectations.json) validates the dataset; monitor runs via scripts/monitoring/validate_monitoring.py.
Configuration
| Variable | Purpose |
|---|---|
DATADOG_API_KEY, DATADOG_APP_KEY | Datadog credentials for exporter + monitoring script. |
ANALYTICS_DATABASE_URL | Warehouse connection string (falls back to DATABASE_URL). |
ANALYTICS_TABLE_NAME | Optional override default analytics.api_function_daily. |
TELEMETRY_ENV_PATH | Explicit env path for local DAG runs. |
Troubleshooting
| Symptom | Steps |
|---|---|
| Dashboard flatlines | Run smoke script, verify DogStatsD agent reachable (DD_AGENT_HOST). |
| Missing tenant tags | Ensure auth middleware sets request.state.user and execution responses include X-Relay-Tenant. |
| High cardinality warning | Confirm new metrics truncate RUFIDs; avoid raw UUID tags. |
| Queue metrics missing | Async path must stamp request_time; missing metadata yields queue_delay_ms=None. |
Ownership
- Primary: Platform Observability (
#relay-telemetry) - Secondary: Core Platform Runtime (
#relay-runtime) - Escalate security anomalies to
security@deployrelay.com
Update this runbook whenever metrics, tags, or alert thresholds change.