Two popular data brokers - Kafka, Flume
- Flume - specifically for log aggregation
- Kafka - general purpose
- MQTT - lightweight for IoT, not suitable for big data streams
Kafka (note details are not testable)
- Implemented in Scala + some Java
- Use cases:
- Linkedin
- Activity streams, operational metrics, data bus
- 400 nodes, 18k topics, 220B messages daily
- Netflix
- real-time monitoring, event processing
- See full list in https://student.cs.uwaterloo.ca/~cs451/slides/08 - Beyond Batch Processing.pdf
- Very fast
- Can get up to 2 million writes/second on 3 cheap machines
- This is with 3 producers on 3 different machines
- Achieved with fast writes - all data is persisted to disk, but all writes go to page cache of OS (RAM)
- Fast reads - efficient to transfer data from page cache to network socket

- Producers send messages to cluster, which are divided among partitions
- Partitions are replicated on other nodes in cluster
- e.g. 2x replication - R = 2 replicas

- Messages from producer are hashed ( if key exists) or balanced round-robin, appended to partitions
- Batches of messages have a unique key - brokers log these, reject duplicates
- This is how one may achieve “write-only-once” semantics
Flume
- Distributed service for collecting, aggregating, moving large amounts of log data