There is an extremely high volume that is produced daily + needs to be stored.
- only reason to delete data is if cost of keeping the data is too high
The price of storage has been decreasing over time:


Idea of data scale:
- Facebook generates 4PB /day (4 million GB)
- 500 million new tweets per day (around 60GB just for text)
- 720,000 hours of new YouTube videos per day
Main sources of data are from:
- businesses
- mainly needed for data-driven decision making and product design, as well as targeted advertising
- business intelligence - encompasses data warehousing, data mining, analytics
- knowing user behavior is important
- e.g. items frequently found to be bought together ⇒ placed further apart in store
- scientists
- Data-intensive eScience - modern experiments generates large amounts of data
- e.g. sequencing a genome
- e.g. large hadron collider - generates 1 PB/experiment
- people
- for their own purposes
- e.g. social media
We need to be able to process this data.
- Vertical scaling ⇒ use more RAM, disk, CPU (i.e. use a better computer)
- this is expensive and limited
- Horizontal scaling ⇒ use more computers, compute distributed
- parallelization is hard to deal with
- often is done in a data center - need fault tolerance, minimize communication
- this is preferred over vertical scaling - more cost efficient (costs do not scale linearly)
- cluster of low-end servers approaches performance of equivalent cluster of high-end servers
Rather than thinking about how to design a distributed network, we just use one.