<aside> β οΈ This note serves as a reminder of the book's content, including additional research on the mentioned topics. It is not a substitute for the book. Most images are sourced from the book or referenced.
</aside>
<aside> π¨ I've noticed that taking notes on this site while reading the book significantly extends the time it takes to finish the book. I've stopped noting everything, as in previous chapters, and instead continue reading by highlighting/hand-writing notes instead. I plan to return to the detailed style when I have more time.
</aside>
<aside> β This book contains 1007 pages of readable content. If you read at a pace of 10 pages per day, it will take you approximately 3.3 months (without missing a day) to finish it. If you aim to complete it in 2 months, you'll need to read at least 17 pages per day.
</aside>
<aside> π Jupyter notebook for this chapter: on Github, on Colab, on Kaggle.
</aside>
In this chapter you will work through an example project end to end.
In this chapter, we use California Housing Prices dataset (or download it from the authorβs repository).
Fig 2-1. California housing prices
This data includes metrics as the population, median income, median housing price for each block group (called βdistrictβ for short).
Your model should learn from this data β predict the median housing price in any district.
<aside> β You should pull out this ML project checklist (Appendix A in the book) for each project.
</aside>
Ask questions to find the methods.
Question: What exactly the business objective is? (find a model isnβt a final goal) β Business objective: Whether itβs worth to invest in a given area?
Fig 2-2. A machine learning pipeline for real estate investments
Question: What the current solution looks like (if any)? β a ref for performance β currently estimated manually by experts. β Their estimates were off by more than 30%.
<aside> β Pipeline = a sequence of data processing components is called a data pipeline*.*** Each component is handled by a team. The whole process is robust.
</aside>
Question: What kind of training supervision the model (supervised, unsupervised, semi-supervised, self-supervised of reinforcement)? Classification / Regression / ? Use batch learning / online learning?
<aside> β If data were huge β split batch learning across multiple servers (use MapReduce technique) or online learning.
</aside>