First half of the final project will be:

Guidelines

How big should my data be?

Since the topic of this course is machine learning and data science at scale, you will be expected to obtain a dataset that does not fit into memory on a single machine, or requires some kind of interesting architecture (e.g., GPUs) to compute on. Anything less than ~30GB is probably too small.

Datasets

Other lists of datasets: