Project Part 1 Guidelines (Spring 2021)

First half of the final project will be:

(Project A): ML Pipeline at Scale
- Select a dataset
- Perform all steps of the ML pipeline
- Visualize your results – we expect an answer to an interesting question you set out to solve!
- Pick at least 3 metrics that you will track:
  - accuracy, time, computation, communication, storage, ease-of-use, interpretability, …

Guidelines

You must work in groups of 2-3 people.
You must get approval to begin your project after you submit the project proposal (the project proposal serves as a sanity check to ensure that you have chosen a suitable project).
You cannot use any late days for the project deliverables.

How big should my data be?

Since the topic of this course is machine learning and data science at scale, you will be expected to obtain a dataset that does not fit into memory on a single machine, or requires some kind of interesting architecture (e.g., GPUs) to compute on. Anything less than ~30GB is probably too small.

Datasets

Complete Public Reddit Comments Corpus (150GB compressed)
Page view statistics for Wikimedia projects (∼2GB/day compressed)
Stack Exchange Data Dump (25GB)
Enron Emails (154.1GB)
Million Song Dataset (199GB)
Project Gutenberg, The text of over 42,000 free ebooks. (742GB)
Dark Net Market archives, 2011-2015 (52GB compressed)

Other lists of datasets: