Data Brokers | Notion

Two popular data brokers - Kafka, Flume

Kafka (note details are not testable)

Implemented in Scala + some Java
Use cases:
- Linkedin
  - Activity streams, operational metrics, data bus
  - 400 nodes, 18k topics, 220B messages daily
- Netflix
  - real-time monitoring, event processing
- See full list in https://student.cs.uwaterloo.ca/~cs451/slides/08 - Beyond Batch Processing.pdf
Very fast
- Can get up to 2 million writes/second on 3 cheap machines
  - This is with 3 producers on 3 different machines
- Achieved with fast writes - all data is persisted to disk, but all writes go to page cache of OS (RAM)
- Fast reads - efficient to transfer data from page cache to network socket

Producers send messages to cluster, which are divided among partitions
- Partitions are replicated on other nodes in cluster
- e.g. 2x replication - R = 2 replicas

Messages from producer are hashed ( if key exists) or balanced round-robin, appended to partitions
Batches of messages have a unique key - brokers log these, reject duplicates
- This is how one may achieve “write-only-once” semantics

Flume

Distributed service for collecting, aggregating, moving large amounts of log data