Design a system that collects metrics (like CPU usage, latency) from thousands of servers, processes them, and allows for real-time dashboarding and alerting.

Scale: 10,000 servers, each emitting 10 core metrics (like CPU, memory, disk I/O, network) every 10 seconds

Features: dashboarding (real-time visualization) and alerting

high availability, low query latency

API Endpoint

We will install an agent on each server, which can collect data and send it back to a central server.

POST /api/data:

{
  "server_id": String,
  "timestamp": Timestamp,
  "data": {
    // data
  }
}

Design

server agents
- gather key metrics such as CPU usage, memory usage, system logs, and web server logs
- the data collection interval shouldn't be too short. This is because the data collection process itself requires resources, including CPU cycles. If data collection occurs too frequently, it could significantly consume these resources, thus impacting the performance of the server being monitored.
a data ingestion system (Kafka/RabbitMQ)
a stream processing system (Flink/Storm)
a time-series database (InfluxDB/TimescaleDB)
an alert notification system

Source: System Design School https://systemdesignschool.io/problems/realtime-monitoring-system

Push or Pull

Source: System Design School https://systemdesignschool.io/problems/realtime-monitoring-system

	Push System	Pull System
Advantages	1. Real-time data delivery. 2. Efficient for large, infrequent data.	1. On-demand data transfer. 2. Receiver has control.
Disadvantages	1. Potential for data overload. 2. Less flexibility for receiver.	1. Not real-time. 2. Can be inefficient for large, infrequent data.