API Budgets & Guardrails
Relay enforces per-function budgets for latency, cost, and concurrency. Use this runbook to understand policy evaluation and mitigation steps.
Budget Model
- Dimensions:
tenant_id,function_rufid, optionalenvironment. - Metrics: latency (p50/p95/p99), concurrency, error budget %, estimated cost per invocation.
- Policies stored in
function_budget_policy; hourly snapshots infunction_budget_snapshotfeed dashboards.
Enforcement
- Execution router loads active policy from cache (60s TTL).
- For each invocation, router compares live metrics (from FN-056 telemetry) to thresholds.
- Actions:
- Allow – within budget.
- Warn – attach
x-relay-budget-warningheader; emitbudget.alertevent. - Degrade – route to fallback branch or cached response.
- Throttle / Deny – return HTTP 429 or queue job when thresholds exceeded.
- All decisions logged and surfaced in Datadog (
budget.alert) + console dashboard.
Console View
- Executions tab shows budget health meters (status badges, trend sparkline).
- Detail drawer lists recent breaches, recommended mitigation (scale, enable cache, adjust budgets).
- Policy editor requires admin scope; changes trigger cache invalidation + audit entry.
Runbooks
- Latency breach
- Confirm breach via Datadog (
Relay / API Telemetrydashboard). - Check concurrency; scale worker replicas or enable edge caching (FN-059).
- Consider temporary override via console (documented with reason + expiry).
- Confirm breach via Datadog (
- Cost spike
- Validate inputs (payload size, retries).
- Coordinate with Finance before raising limits; monitor usage ledger.
- Error budget exhaustion
- Inspect recent deployments; roll back if regressions introduced.
- Activate canary/experiment overrides using console’s policy editor.
Alerts & Notifications
- Datadog monitors for sustained p95/p99 breaches, cost daily budget, and error rate thresholds.
- Slack notifications (
#relay-runtime) include recommended action from policy engine.
CLI / API
- API endpoint
POST /api/v1/budgets(admin scope) updates policies. - CLI command (planned):
relay budgets set --tenant <id> --rufid <id> --latency-p95 500.
Related Assets
- Design doc:
dev_process/planning/FN-057_BUDGETS_DESIGN.md - Integration tests:
tests/integration/execution/test_budget_enforcement.py - Telemetry source: FN-056 runbook (
operations/api-telemetry)
Contacts
- Owner: Platform Performance Team (
#relay-performance) - Escalation: Ops on-call (
#relay-runtime)
Keep this guide updated when policy schema or console flows change.