Background
Nowadays people use different compute systems for different kinds of workloads.
|
Batch Processing |
OLAP |
Streaming |
| Presto/Trino |
⭐️ ★ ★ |
⭐️ ⭐️ ⭐️ |
★★★ |
| Spark |
⭐️ ⭐️ ⭐️ |
⭐️ ★ ★ |
⭐️⭐️★ |
| Impala |
★★★ |
⭐️ ⭐️ ⭐️ |
★★★ |
| Flink |
⭐️ ★ ★ |
⭐️⭐️★ |
⭐️ ⭐️ ⭐️ |
Use cases
- Batch Processing: ETL jobs which may last for several hours, even days.
- OLAP: Short running sql queries which lasts from seconds to minutes.
- Streaming: Long running jobs continues to processing streaming data in low latency
Scores
★★★: Can’t process at all.
⭐️★★: Can handle some cases, but not good at.
⭐️⭐️★: Can handle most cases, but not dominating the market.
⭐️⭐️⭐️: Dominating the market.
Problems
- Most of above systems are written using java, which is inefficient.
- Not designed for cloud, but designed for hadoop.
- No one system can support all workloads.
Design
Design Goal
The goal is to design a unified platform for different computing workloads:
- Large scale batch processing