Skip to main content

API Budgets & Guardrails

Relay enforces per-function budgets for latency, cost, and concurrency. Use this runbook to understand policy evaluation and mitigation steps.

Budget Model

  • Dimensions: tenant_id, function_rufid, optional environment.
  • Metrics: latency (p50/p95/p99), concurrency, error budget %, estimated cost per invocation.
  • Policies stored in function_budget_policy; hourly snapshots in function_budget_snapshot feed dashboards.

Enforcement

  1. Execution router loads active policy from cache (60s TTL).
  2. For each invocation, router compares live metrics (from FN-056 telemetry) to thresholds.
  3. Actions:
    • Allow – within budget.
    • Warn – attach x-relay-budget-warning header; emit budget.alert event.
    • Degrade – route to fallback branch or cached response.
    • Throttle / Deny – return HTTP 429 or queue job when thresholds exceeded.
  4. All decisions logged and surfaced in Datadog (budget.alert) + console dashboard.

Console View

  • Executions tab shows budget health meters (status badges, trend sparkline).
  • Detail drawer lists recent breaches, recommended mitigation (scale, enable cache, adjust budgets).
  • Policy editor requires admin scope; changes trigger cache invalidation + audit entry.

Runbooks

  1. Latency breach
    • Confirm breach via Datadog (Relay / API Telemetry dashboard).
    • Check concurrency; scale worker replicas or enable edge caching (FN-059).
    • Consider temporary override via console (documented with reason + expiry).
  2. Cost spike
    • Validate inputs (payload size, retries).
    • Coordinate with Finance before raising limits; monitor usage ledger.
  3. Error budget exhaustion
    • Inspect recent deployments; roll back if regressions introduced.
    • Activate canary/experiment overrides using console’s policy editor.

Alerts & Notifications

  • Datadog monitors for sustained p95/p99 breaches, cost daily budget, and error rate thresholds.
  • Slack notifications (#relay-runtime) include recommended action from policy engine.

CLI / API

  • API endpoint POST /api/v1/budgets (admin scope) updates policies.
  • CLI command (planned): relay budgets set --tenant <id> --rufid <id> --latency-p95 500.
  • Design doc: dev_process/planning/FN-057_BUDGETS_DESIGN.md
  • Integration tests: tests/integration/execution/test_budget_enforcement.py
  • Telemetry source: FN-056 runbook (operations/api-telemetry)

Contacts

  • Owner: Platform Performance Team (#relay-performance)
  • Escalation: Ops on-call (#relay-runtime)

Keep this guide updated when policy schema or console flows change.