Spark Structured Streaming
실시간 데이터를 처리하기 위한 tool
InputSources
- Socket source : Learning purpose or for testing purpose
- Rate source : Testing and Benchmarking spark cluster
- File source : Reads files in a directory(based on file modification time)
- text, CSV, JSON, ORC, Parquet and more
- File must be atomically placed with moving command
- Kafka source : kafka broker versions 0.10.0 or higher
TriggerType
- unspecified : at best ~100ms micro-batch mode, start as soon as the prev one finishes
- Fixed interval:
- previous duration ‹ interval, wait until the interval comes, then kick off
- previous duration› interval, as soon as the prev one finishes, it kicks off
- no data, no kick off
- One-time(deprecated) : process all the available data and stop on its own
- Available-now : similar to one-time, but multiple micro-batches based on the source options (e.g.maxFilesPerTrigger for file source)
- Continuous with fixed checkpoint interval : low(~1ms) end-to-end latency with at-least-once fault-tolerance guarantees

Fault Tolerance
Spark goal is, end-to-end exactly-once semantics
- Do not miss any input records
- Do not generate the duplicate output records
Requirements
- tracking offset (read position) using checkpoint and write-ahead logs