Execution Engine Deep Dive

The execution engine is responsible for preparing, running, and finalizing every function invocation. This guide explains its architecture, lifecycle, and extension points.

Architecture Overview

Request Intake — API receives a function execution request (sync or async) and resolves the target RUFID plus tenant context.
Scheduling — Work is dispatched to execution workers via the queue adapter (Redis or Postgres backends) with sharding by tenant and priority.
Sandbox Prep — Workers hydrate the sandbox, install declared dependencies, and enforce resource limits (CPU, memory, wall-clock timeout).
Execution — Code runs inside the managed runtime with pluggable adapters (Python today, Node.js and WASM in roadmap).
Finalization — Results, metrics, and logs persist. Cleanup hooks tear down the sandbox, release slots, and emit observability signals.

Key Components

Component	Description
`ExecutionEngine`	High-level orchestrator that validates requests, picks runtimes, and triggers execution.
`SandboxManager`	Creates isolated runtimes (Docker, Firecracker, or native process) with per-tenant limits.
`ResourceLimitsEnforcer`	Applies CPU/memory/time budgets and interrupts long-running jobs.
`MetricsCollector`	Emits latency, cost, and queue metrics consumed by the telemetry baseline.
`ResultStore`	Persists outputs, logs, and artefacts for later retrieval.

Find implementation details under src/services/execution_engine/ and related tests in tests/unit/execution/.

Lifecyle Hooks

Pre-execution — resolve manifests, hydrate secrets, load dependencies.
Execution — stream logs, monitor resource usage, collect traces.
Post-execution — persist results, emit events, run cleanup.

The helper _finalize_execution ensures sandboxes shut down even when errors occur; see tests/unit/execution/test_execution_engine_adapter.py.

Configurable Settings

EXECUTION_TIMEOUT_SECONDS
EXECUTION_MEMORY_LIMIT_MB
SANDBOX_RUNTIME (python, python-firecracker, etc.)
EXECUTION_CONCURRENCY (per worker)
ENABLE_TELEMETRY_TRACES

Use the Configuration Hardening guide to keep production profiles secure.

Observability

Metrics: relay.execution.duration, relay.execution.queue_delay, relay.execution.cost
Logs: structured JSON via the worker logger
Tracing: OpenTelemetry spans exported when configured

Verify instrumentation with scripts/monitoring/api_telemetry_smoke.sh and the Datadog dashboards in deployment/datadog/.

Branch-aware Execution (FN-102)

Execution traffic can be scoped to a specific branch to validate function changes without touching production callers. The API exposes two entry points:

Query parameter: POST /api/v1/functions/{rufid}/resolve?branch=dev
Header: X-Relay-Branch: dev

For non-main branches, the caller must also supply X-Relay-Branch-Key with the branch access token (currently the dev API key) or the request is rejected with 403 Branch access denied. The resolver and execution paths propagate the branch through caching, orchestration, and logging, ensuring cache keys are isolated per branch (rufid:{short}|branch:{name}) and preventing cross-branch data leakage.

Example cURL invocation:

curl -X POST "https://api.deployrelay.com/api/v1/functions/abcdef123456/resolve?branch=dev" \
  -H 'X-Relay-Branch-Key: ak_dev_...'

Use the same branch headers when calling the execution endpoint:

curl -X POST "https://api.deployrelay.com/api/v1/functions/abcdef123456/execute" \
  -H 'Content-Type: application/json' \
  -H 'X-Relay-Branch: dev' \
  -H 'X-Relay-Branch-Key: ak_dev_...' \
  -d '{"input": {"ping": "pong"}}'

Both the resolve and execute handlers fall back to the main branch when no branch is provided. Branch payloads draw from the version ledger (function_version_ledger) and branch head mapping (function_branch_heads), allowing nightly snapshots and cache warmers to stay branch-aware as well.

Extending the Engine

Runtime adapters — implement the runtime interface under src/services/execution_engine/runtimes/.
Sandbox providers — add new isolation strategies via SandboxManager plugins.
Result sinks — customize persistence by implementing the ResultStorePort.

Failure Modes & Runbooks

Scenario	Symptom	Mitigation
Dependency install failure	Logs show missing packages	Validate manifests, pre-bake images, retry with cache disabled
Timeout	Execution exceeds `EXECUTION_TIMEOUT_SECONDS`	Profile code, increase limit, or offload to async workflow
Sandbox exhaustion	Queue backlog rises	Scale worker replicas, tune concurrency, enforce per-tenant limits
Cost spike	`relay.execution.cost` breaches budget	Use budget guardrails (FN-057) to throttle or degrade

Architecture Overview​

Key Components​

Lifecyle Hooks​

Configurable Settings​

Observability​

Branch-aware Execution (FN-102)​

Extending the Engine​

Failure Modes & Runbooks​

Related Resources​