Streaming Basics

Comparing batch processing & stream processing:

Batch Processing
- Involves processing all data at once
- Not a real-time operation
- Must wait a significant time for task to complete
Stream Processing
- Involves processing data as it arrives
- Close to real-time
- Expected to be low latency

Streaming is important for business intelligence:

e.g. generating reports, creating monitoring dashboard, real-time statistics, checking suspicious transactions, etc.

A typical data streaming pipeline looks like the following:

Data sources (e.g. twitter)
Data ingestion systems
- Facilitate movement of data to stream processing engine
Stream processing engine
- Can write results to database
- Can push data to some application

A data stream is a sequence of items (tuples) that is:

Structured
Ordered
- Can be ordered by timestamp or implicitly by arrival time
Continuously arriving at high volume
- This may not be uniform - different peak times
- May not be possible to store or examine all data

Data streams are processed via:

filter (select), map, flatMap (projection/transform)
Group, aggregation, joins
- Note that these operations require on having all data available ⇒ must be defined differently for a data stream

We must consider the semantics of: