Stream Processing Frameworks (Spark Streaming)

Spark stream processing frameworks include Apache Spark Streaming, Apache Storm, and Apache Flink.

With Spark Streaming, we often deal with discrete streams (DStream ).

Tumbling window with a very small duration (typically 1 second)
- Recall that a sliding window can be made out of many tumbling windows
Every $t$ seconds, collect all data into an RDD
Process RDD through Spark Engine, similar to any other RDD

Usually create a Spark Streaming Context, usually in variable called scc
- Time slice is specified when creating
DStream has RDD transforms + its own transforms and actions

Key concepts with Spark Streaming:

DStream
- Sequence of RDDs representing a stream of data
- e.g. could be Twitter, Kafka, Flume, TCP sockets
Transformations
- Modify data from one DStream to another
- Includes Standard RDD operations (map, countByValue , reduce , join , etc.)
- Includes stateful operations (window, countByValueAndWindow , etc.)
Output Operations
- Involves sending data to external entities
- e.g. saveAsHadoopFiles - save to HDFS
- e.g. foreach - do anything with a batch of results
- e.g. can output back to Kafka, etc.
e.g. usage with Tweet streams:
- Basic example with constructing
  - Transformation (map) has 1:1 RDD matching
- Count hashtags for last second - problem is that this is a very small window
- Instead, can use .window - get things over 10 minute time interval, 1 second windows
  - This is much more efficient compared to polling every second
These are fault tolerant - are RDDs, Spark deals with this
- Replicate input data in memory in case server crashes (i.e. not persisted)

In general, incremental counting generalizes to many reduce operations. We need a function to “inverse reduce” (e.g. “subtract” for counting).

Spark streaming is fast: