Execution Engine Deep Dive
The execution engine is responsible for preparing, running, and finalizing every function invocation. This guide explains its architecture, lifecycle, and extension points.
Architecture Overview
- Request Intake — API receives a function execution request (sync or async) and resolves the target RUFID plus tenant context.
- Scheduling — Work is dispatched to execution workers via the queue adapter (Redis or Postgres backends) with sharding by tenant and priority.
- Sandbox Prep — Workers hydrate the sandbox, install declared dependencies, and enforce resource limits (CPU, memory, wall-clock timeout).
- Execution — Code runs inside the managed runtime with pluggable adapters (Python today, Node.js and WASM in roadmap).
- Finalization — Results, metrics, and logs persist. Cleanup hooks tear down the sandbox, release slots, and emit observability signals.
Key Components
| Component | Description |
|---|---|
ExecutionEngine | High-level orchestrator that validates requests, picks runtimes, and triggers execution. |
SandboxManager | Creates isolated runtimes (Docker, Firecracker, or native process) with per-tenant limits. |
ResourceLimitsEnforcer | Applies CPU/memory/time budgets and interrupts long-running jobs. |
MetricsCollector | Emits latency, cost, and queue metrics consumed by the telemetry baseline. |
ResultStore | Persists outputs, logs, and artefacts for later retrieval. |
Find implementation details under src/services/execution_engine/ and related tests in tests/unit/execution/.
Lifecyle Hooks
- Pre-execution — resolve manifests, hydrate secrets, load dependencies.
- Execution — stream logs, monitor resource usage, collect traces.
- Post-execution — persist results, emit events, run cleanup.
The helper _finalize_execution ensures sandboxes shut down even when errors occur; see tests/unit/execution/test_execution_engine_adapter.py.
Configurable Settings
EXECUTION_TIMEOUT_SECONDSEXECUTION_MEMORY_LIMIT_MBSANDBOX_RUNTIME(python,python-firecracker, etc.)EXECUTION_CONCURRENCY(per worker)ENABLE_TELEMETRY_TRACES
Use the Configuration Hardening guide to keep production profiles secure.
Observability
- Metrics:
relay.execution.duration,relay.execution.queue_delay,relay.execution.cost - Logs: structured JSON via the worker logger
- Tracing: OpenTelemetry spans exported when configured
Verify instrumentation with scripts/monitoring/api_telemetry_smoke.sh and the Datadog dashboards in deployment/datadog/.
Branch-aware Execution (FN-102)
Execution traffic can be scoped to a specific branch to validate function changes without touching production callers. The API exposes two entry points:
- Query parameter:
POST /api/v1/functions/{rufid}/resolve?branch=dev - Header:
X-Relay-Branch: dev
For non-main branches, the caller must also supply X-Relay-Branch-Key with the branch access token (currently the dev API key) or the request is rejected with 403 Branch access denied. The resolver and execution paths propagate the branch through caching, orchestration, and logging, ensuring cache keys are isolated per branch (rufid:{short}|branch:{name}) and preventing cross-branch data leakage.
Example cURL invocation:
curl -X POST "https://api.deployrelay.com/api/v1/functions/abcdef123456/resolve?branch=dev" \
-H 'X-Relay-Branch-Key: ak_dev_...'
Use the same branch headers when calling the execution endpoint:
curl -X POST "https://api.deployrelay.com/api/v1/functions/abcdef123456/execute" \
-H 'Content-Type: application/json' \
-H 'X-Relay-Branch: dev' \
-H 'X-Relay-Branch-Key: ak_dev_...' \
-d '{"input": {"ping": "pong"}}'
Both the resolve and execute handlers fall back to the main branch when no branch is provided. Branch payloads draw from the version ledger (function_version_ledger) and branch head mapping (function_branch_heads), allowing nightly snapshots and cache warmers to stay branch-aware as well.
Extending the Engine
- Runtime adapters — implement the runtime interface under
src/services/execution_engine/runtimes/. - Sandbox providers — add new isolation strategies via
SandboxManagerplugins. - Result sinks — customize persistence by implementing the
ResultStorePort.
Failure Modes & Runbooks
| Scenario | Symptom | Mitigation |
|---|---|---|
| Dependency install failure | Logs show missing packages | Validate manifests, pre-bake images, retry with cache disabled |
| Timeout | Execution exceeds EXECUTION_TIMEOUT_SECONDS | Profile code, increase limit, or offload to async workflow |
| Sandbox exhaustion | Queue backlog rises | Scale worker replicas, tune concurrency, enforce per-tenant limits |
| Cost spike | relay.execution.cost breaches budget | Use budget guardrails (FN-057) to throttle or degrade |