Reading: Hands-On ML - Chap 2: End-to-End ML Project

👉 List of all notes for this book. IMPORTANT UPDATE November 18, 2024: I've stopped taking detailed notes from the book and now only highlight and annotate directly in the PDF files/book. With so many books to read, I don't have time to type everything. In the future, if I make notes while reading a book, they'll contain only the most notable points (for me).

<aside> 📔 Jupyter notebook for this chapter: on Github, on Colab, on Kaggle.

</aside>

Main steps

In this chapter you will work through an example project end to end.

Steps

Working with Real Data

Few places to get data

In this chapter, we use California Housing Prices dataset (or download it from the author’s repository).

Fig 2-1. California housing prices

This data includes metrics as the population, median income, median housing price for each block group (called “district” for short).

Look at the Big Picture

Your model should learn from this data → predict the median housing price in any district.

<aside> ☝ You should pull out this ML project checklist (Appendix A in the book) for each project.

</aside>

Frame the Problem

Ask questions to find the methods.

Question: What exactly the business objective is? (find a model isn’t a final goal) → Business objective: Whether it’s worth to invest in a given area?

Fig 2-2. A machine learning pipeline for real estate investments
Question: What the current solution looks like (if any)? ← a ref for performance → currently estimated manually by experts. ← Their estimates were off by more than 30%.

<aside> ☝ Pipeline = a sequence of data processing components is called a data pipeline*.*** Each component is handled by a team. The whole process is robust.

</aside>

Question: What kind of training supervision the model (supervised, unsupervised, semi-supervised, self-supervised of reinforcement)? Classification / Regression / ? Use batch learning / online learning?
- supervised ← model trained with labeled examples.
- multiple regression ← predict a value, use multiple features.
- univariate regression ← predict a single value for each district. If we want to predict multiple values → multivariate regression.
- Batch learning ← no continuous flow data, no need to adjust data, data is small.
<aside> ☝ If data were huge → split batch learning across multiple servers (use MapReduce technique) or online learning.

</aside>