Skip to main content

Playground Support Runbook

Operational checklist for the sandbox environment at playground.rufid.run.

Contacts

Daily Maintenance

  • GitHub Actions workflow Playground Reset (.github/workflows/playground-reset.yml) truncates database tables, flushes Redis, and posts status to Slack.
  • Verify reset completion in Slack; rerun manually via workflow dispatch if failures occur.

Monitoring

  • Datadog monitors tagged environment:playground (latency, errors, queue delay).
  • Cloudflare analytics: watch for rate-limit triggers or unusual geographies.
  • Synthetic: /healthz check every 5 minutes; alerts routed to #relay-playground.

Incident Response

  1. Contain — Disable playground services via Railway (railway pause relay-playground-api & worker) if abuse detected.
  2. Investigate — Pull logs from Datadog + Slack notifications, inspect scripts/maintenance/reset_playground.log from latest run.
  3. Remediate — Rotate sandbox secrets, run reset script manually, re-enable services after validation.
  4. Communicate — Update #relay-playground with status; file post-incident summary in dev_process/incidents/.

Troubleshooting Guide

SymptomAction
403 errors on catalogEnsure ALLOWED_NAMESPACES includes requested namespace, run reset script.
Excess latencyCheck worker CPU/memory allocations, inspect Datadog metrics, scale Railway worker.
Reset workflow failedReview GitHub Action logs, rerun with workflow_dispatch, verify secrets.
Abuse spikeActivate Cloudflare firewall rule, notify security, temporarily rate-limit offending IPs.

Reference

  • Deployment guide: docs/deployment/PLAYGROUND_GUIDE.md
  • Reset script: scripts/maintenance/reset_playground.sh
  • Railway env sample: deployment/playground/railway.env.sample
  • Marketing announcement template: docs-portal/docs/marketing/playground-announcement.md