Spark Structured Streaming

실시간 데이터를 처리하기 위한 tool

Socket source : Learning purpose or for testing purpose
Rate source : Testing and Benchmarking spark cluster
File source : Reads files in a directory(based on file modification time)
- text, CSV, JSON, ORC, Parquet and more
- File must be atomically placed with moving command
Kafka source : kafka broker versions 0.10.0 or higher

unspecified : at best ~100ms micro-batch mode, start as soon as the prev one finishes
Fixed interval:
- previous duration ‹ interval, wait until the interval comes, then kick off
- previous duration› interval, as soon as the prev one finishes, it kicks off
- no data, no kick off
One-time(deprecated) : process all the available data and stop on its own
Available-now : similar to one-time, but multiple micro-batches based on the source options (e.g.maxFilesPerTrigger for file source)
Continuous with fixed checkpoint interval : low(~1ms) end-to-end latency with at-least-once fault-tolerance guarantees
- 베타버젼, unspecified와 비슷

Untitled

Spark goal is, end-to-end exactly-once semantics

Requirements