Observability

Why observability

Visibility into system health: “Are my systems up? 충분한 리소스?”
Error reporting & alerting: proactive alerts/anomaly detection, red/green at-a-glance.
Efficient troubleshooting: correlate metrics+logs+traces in one place; show actionable next step (ex. quota ↑).
Performance improvement: trends, capacity planning, change impact review (회고/회귀 확인).

SRE 정의: Collecting, processing, aggregating, displaying real-time quantitative data (queries, errors, latency, lifetimes…).
현실 조건: capacity 확보 + 자동화된 테스트 + CI/CD + 모니터링 + blameless postmortem(투명성).

Dashboards for ops + biz.
Automated alerts → 가능하면 auto-remediation (사람은 크리티컬만).
Typical loop: trigger (outage/data loss/monitoring failure/manual) → signals 확인 → 영향 평가/에스컬 → 초기 대응 → 고객 커뮤니케이션.

Latency (지연) — user experience에 직격, 이슈 전조, 개선 측정.
- Examples: page load, service response time, TTFB/TTR, query/transaction duration, queued requests.
Traffic (부하/요청량) — 현재 수요, 용량 계획, 비용 산정.
- Examples: HTTP RPS, static vs dynamic ratio, network I/O, TPS, concurrent sessions, active connections, read/write ops.
Saturation (포화도) — 얼마나 꽉 찼는지, 병목 리소스에 집중, 포화 시 성능 저하.
- Examples: %CPU/%Mem/%Disk/%Cache, thread-pool usage, quotas, available connections/users.
Errors (오류) — 실패 신호, SLO 위반, alert 트리거.
- Examples: wrong/invalid results, 4xx/5xx, failed requests, exceptions/stack traces, liveness fail, dropped connections.