First half of the final project will be:
- (Project A): ML Pipeline at Scale
- Select a dataset
- Perform all steps of the ML pipeline
- Visualize your results – we expect an answer to an interesting question you set out to solve!
- Pick at least 3 metrics that you will track:
- accuracy, time, computation, communication, storage, ease-of-use, interpretability, …
Guidelines
- You must work in groups of 2-3 people.
- You must get approval to begin your project after you submit the project proposal (the project proposal serves as a sanity check to ensure that you have chosen a suitable project).
- You cannot use any late days for the project deliverables.
How big should my data be?
Since the topic of this course is machine learning and data science at scale, you will be expected to obtain a dataset that does not fit into memory on a single machine, or requires some kind of interesting architecture (e.g., GPUs) to compute on. Anything less than ~30GB is probably too small.
Datasets
Other lists of datasets: