Skip to main content

Execution Engine Deep Dive

The execution engine is responsible for preparing, running, and finalizing every function invocation. This guide explains its architecture, lifecycle, and extension points.

Architecture Overview

  1. Request Intake — API receives a function execution request (sync or async) and resolves the target RUFID plus tenant context.
  2. Scheduling — Work is dispatched to execution workers via the queue adapter (Redis or Postgres backends) with sharding by tenant and priority.
  3. Sandbox Prep — Workers hydrate the sandbox, install declared dependencies, and enforce resource limits (CPU, memory, wall-clock timeout).
  4. Execution — Code runs inside the managed runtime with pluggable adapters (Python today, Node.js and WASM in roadmap).
  5. Finalization — Results, metrics, and logs persist. Cleanup hooks tear down the sandbox, release slots, and emit observability signals.

Key Components

ComponentDescription
ExecutionEngineHigh-level orchestrator that validates requests, picks runtimes, and triggers execution.
SandboxManagerCreates isolated runtimes (Docker, Firecracker, or native process) with per-tenant limits.
ResourceLimitsEnforcerApplies CPU/memory/time budgets and interrupts long-running jobs.
MetricsCollectorEmits latency, cost, and queue metrics consumed by the telemetry baseline.
ResultStorePersists outputs, logs, and artefacts for later retrieval.

Find implementation details under src/services/execution_engine/ and related tests in tests/unit/execution/.

Lifecyle Hooks

  1. Pre-execution — resolve manifests, hydrate secrets, load dependencies.
  2. Execution — stream logs, monitor resource usage, collect traces.
  3. Post-execution — persist results, emit events, run cleanup.

The helper _finalize_execution ensures sandboxes shut down even when errors occur; see tests/unit/execution/test_execution_engine_adapter.py.

Configurable Settings

  • EXECUTION_TIMEOUT_SECONDS
  • EXECUTION_MEMORY_LIMIT_MB
  • SANDBOX_RUNTIME (python, python-firecracker, etc.)
  • EXECUTION_CONCURRENCY (per worker)
  • ENABLE_TELEMETRY_TRACES

Use the Configuration Hardening guide to keep production profiles secure.

Observability

  • Metrics: relay.execution.duration, relay.execution.queue_delay, relay.execution.cost
  • Logs: structured JSON via the worker logger
  • Tracing: OpenTelemetry spans exported when configured

Verify instrumentation with scripts/monitoring/api_telemetry_smoke.sh and the Datadog dashboards in deployment/datadog/.

Branch-aware Execution (FN-102)

Execution traffic can be scoped to a specific branch to validate function changes without touching production callers. The API exposes two entry points:

  • Query parameter: POST /api/v1/functions/{rufid}/resolve?branch=dev
  • Header: X-Relay-Branch: dev

For non-main branches, the caller must also supply X-Relay-Branch-Key with the branch access token (currently the dev API key) or the request is rejected with 403 Branch access denied. The resolver and execution paths propagate the branch through caching, orchestration, and logging, ensuring cache keys are isolated per branch (rufid:{short}|branch:{name}) and preventing cross-branch data leakage.

Example cURL invocation:

curl -X POST "https://api.deployrelay.com/api/v1/functions/abcdef123456/resolve?branch=dev" \
-H 'X-Relay-Branch-Key: ak_dev_...'

Use the same branch headers when calling the execution endpoint:

curl -X POST "https://api.deployrelay.com/api/v1/functions/abcdef123456/execute" \
-H 'Content-Type: application/json' \
-H 'X-Relay-Branch: dev' \
-H 'X-Relay-Branch-Key: ak_dev_...' \
-d '{"input": {"ping": "pong"}}'

Both the resolve and execute handlers fall back to the main branch when no branch is provided. Branch payloads draw from the version ledger (function_version_ledger) and branch head mapping (function_branch_heads), allowing nightly snapshots and cache warmers to stay branch-aware as well.

Extending the Engine

  • Runtime adapters — implement the runtime interface under src/services/execution_engine/runtimes/.
  • Sandbox providers — add new isolation strategies via SandboxManager plugins.
  • Result sinks — customize persistence by implementing the ResultStorePort.

Failure Modes & Runbooks

ScenarioSymptomMitigation
Dependency install failureLogs show missing packagesValidate manifests, pre-bake images, retry with cache disabled
TimeoutExecution exceeds EXECUTION_TIMEOUT_SECONDSProfile code, increase limit, or offload to async workflow
Sandbox exhaustionQueue backlog risesScale worker replicas, tune concurrency, enforce per-tenant limits
Cost spikerelay.execution.cost breaches budgetUse budget guardrails (FN-057) to throttle or degrade