Intro to Big Data (NOT COVERED)

There is an extremely high volume that is produced daily + needs to be stored.

The price of storage has been decreasing over time:

Idea of data scale:

Main sources of data are from:

businesses
- mainly needed for data-driven decision making and product design, as well as targeted advertising
- business intelligence - encompasses data warehousing, data mining, analytics
- knowing user behavior is important
- e.g. items frequently found to be bought together ⇒ placed further apart in store
scientists
- Data-intensive eScience - modern experiments generates large amounts of data
- e.g. sequencing a genome
- e.g. large hadron collider - generates 1 PB/experiment
people
- for their own purposes
- e.g. social media

We need to be able to process this data.

Vertical scaling ⇒ use more RAM, disk, CPU (i.e. use a better computer)
- this is expensive and limited
Horizontal scaling ⇒ use more computers, compute distributed
- parallelization is hard to deal with
- often is done in a data center - need fault tolerance, minimize communication
- this is preferred over vertical scaling - more cost efficient (costs do not scale linearly)
  - cluster of low-end servers approaches performance of equivalent cluster of high-end servers

Rather than thinking about how to design a distributed network, we just use one.