Comparing batch processing & stream processing:
- Batch Processing
- Involves processing all data at once
- Not a real-time operation
- Must wait a significant time for task to complete
- Stream Processing
- Involves processing data as it arrives
- Close to real-time
- Expected to be low latency
Streaming is important for business intelligence:
- e.g. generating reports, creating monitoring dashboard, real-time statistics, checking suspicious transactions, etc.
A typical data streaming pipeline looks like the following:
- Data sources (e.g. twitter)
- Data ingestion systems
- Facilitate movement of data to stream processing engine
- Stream processing engine
- Can write results to database
- Can push data to some application

A data stream is a sequence of items (tuples) that is:
- Structured
- Ordered
- Can be ordered by timestamp or implicitly by arrival time
- Continuously arriving at high volume
- This may not be uniform - different peak times
- May not be possible to store or examine all data
Data streams are processed via:
filter
(select), map
, flatMap
(projection/transform)
- Group, aggregation, joins
- Note that these operations require on having all data available ⇒ must be defined differently for a data stream
We must consider the semantics of:
- Aggregation/Grouping
- When to start?
- When to stop?