Design a system that collects metrics (like CPU usage, latency) from thousands of servers, processes them, and allows for real-time dashboarding and alerting.
Scale: 10,000 servers, each emitting 10 core metrics (like CPU, memory, disk I/O, network) every 10 seconds
Features: dashboarding (real-time visualization) and alerting
high availability, low query latency
We will install an agent on each server, which can collect data and send it back to a central server.
POST /api/data:
{
"server_id": String,
"timestamp": Timestamp,
"data": {
// data
}
}

Source: System Design School https://systemdesignschool.io/problems/realtime-monitoring-system

Source: System Design School https://systemdesignschool.io/problems/realtime-monitoring-system
| Push System | Pull System | |
|---|---|---|
| Advantages | 1. Real-time data delivery. 2. Efficient for large, infrequent data. | 1. On-demand data transfer. 2. Receiver has control. |
| Disadvantages | 1. Potential for data overload. 2. Less flexibility for receiver. | 1. Not real-time. 2. Can be inefficient for large, infrequent data. |