Playground Support Runbook
Operational checklist for the sandbox environment at playground.rufid.run.
Contacts
- Primary: Platform Ops (
#relay-playground, playground-support@deployrelay.com) - Secondary: Security Engineering (escalate sandbox escapes)
- DevOps on-call:
#relay-runtime
Daily Maintenance
- GitHub Actions workflow
Playground Reset(.github/workflows/playground-reset.yml) truncates database tables, flushes Redis, and posts status to Slack. - Verify reset completion in Slack; rerun manually via workflow dispatch if failures occur.
Monitoring
- Datadog monitors tagged
environment:playground(latency, errors, queue delay). - Cloudflare analytics: watch for rate-limit triggers or unusual geographies.
- Synthetic:
/healthzcheck every 5 minutes; alerts routed to#relay-playground.
Incident Response
- Contain — Disable playground services via Railway (
railway pause relay-playground-api& worker) if abuse detected. - Investigate — Pull logs from Datadog + Slack notifications, inspect
scripts/maintenance/reset_playground.logfrom latest run. - Remediate — Rotate sandbox secrets, run reset script manually, re-enable services after validation.
- Communicate — Update
#relay-playgroundwith status; file post-incident summary indev_process/incidents/.
Troubleshooting Guide
| Symptom | Action |
|---|---|
| 403 errors on catalog | Ensure ALLOWED_NAMESPACES includes requested namespace, run reset script. |
| Excess latency | Check worker CPU/memory allocations, inspect Datadog metrics, scale Railway worker. |
| Reset workflow failed | Review GitHub Action logs, rerun with workflow_dispatch, verify secrets. |
| Abuse spike | Activate Cloudflare firewall rule, notify security, temporarily rate-limit offending IPs. |
Reference
- Deployment guide:
docs/deployment/PLAYGROUND_GUIDE.md - Reset script:
scripts/maintenance/reset_playground.sh - Railway env sample:
deployment/playground/railway.env.sample - Marketing announcement template:
docs-portal/docs/marketing/playground-announcement.md