Design a system that collects metrics (like CPU usage, latency) from thousands of servers, processes them, and allows for real-time dashboarding and alerting.

Scale: 10,000 servers, each emitting 10 core metrics (like CPU, memory, disk I/O, network) every 10 seconds

Features: dashboarding (real-time visualization) and alerting

high availability,  low query latency

API Endpoint

We will install an agent on each server, which can collect data and send it back to a central server.

POST /api/data:

{
  "server_id": String,
  "timestamp": Timestamp,
  "data": {
    // data
  }
}

Design

Source: System Design School https://systemdesignschool.io/problems/realtime-monitoring-system

Source: System Design School https://systemdesignschool.io/problems/realtime-monitoring-system

Push or Pull

Source: System Design School https://systemdesignschool.io/problems/realtime-monitoring-system

Source: System Design School https://systemdesignschool.io/problems/realtime-monitoring-system

Push System Pull System
Advantages 1. Real-time data delivery. 2. Efficient for large, infrequent data. 1. On-demand data transfer. 2. Receiver has control.
Disadvantages 1. Potential for data overload. 2. Less flexibility for receiver. 1. Not real-time. 2. Can be inefficient for large, infrequent data.