Continuously checks your API and services for downtime, errors, or performance degradation. Catches outages at 2am so you find out in minutes, not hours.
Monitor the health of our production services. Check the following:
1. API availability — hit all critical endpoints and confirm they return 200 status codes within acceptable response times
2. Error rate — review recent logs for any spike in 5xx errors or timeout patterns
3. Database connections — check for connection pool exhaustion or slow query warnings
4. Third-party dependencies — verify external APIs and services we depend on are responding
If any service is degraded or down:
- Post immediately to Slack #incidents with severity level (P1/P2/P3)
- Include: what's affected, when it started, and suggested next steps
- If possible, identify the root cause from recent commits or config changes
If everything is healthy, log a brief "all systems operational" confirmation.
| Setting | Value |
|---|---|
| Frequency | Every 30 minutes or every hour |
| Active Window | 24/7 — this is the whole point |
The difference between a 20-minute outage and a 6-hour one. Your laptop being closed at 2am shouldn't mean your users discover the problem before you do.