Spark stream processing frameworks include Apache Spark Streaming, Apache Storm, and Apache Flink.
With Spark Streaming, we often deal with discrete streams (DStream
).
scc
DStream
has RDD transforms + its own transforms and actionsKey concepts with Spark Streaming:
DStream
Transformations
map
, countByValue
, reduce
, join
, etc.)window
, countByValueAndWindow
, etc.)Output Operations
saveAsHadoopFiles
- save to HDFSforeach
- do anything with a batch of resultse.g. usage with Tweet streams:
Basic example with constructing
Count hashtags for last second - problem is that this is a very small window
Instead, can use .window
- get things over 10 minute time interval, 1 second windows
These are fault tolerant - are RDDs, Spark deals with this
In general, incremental counting generalizes to many reduce operations. We need a function to “inverse reduce” (e.g. “subtract” for counting).
Spark streaming is fast: