<aside> ⚠️ This note serves as a reminder of the book's content, including additional research on the mentioned topics. It is not a substitute for the book. Most images are sourced from the book or referenced.
</aside>
<aside> 🚨 I've noticed that taking notes on this site while reading the book significantly extends the time it takes to finish the book. I've stopped noting everything, as in previous chapters, and instead continue reading by highlighting/hand-writing notes instead. I plan to return to the detailed style when I have more time.
</aside>
<aside> ✊ This book contains 1007 pages of readable content. If you read at a pace of 10 pages per day, it will take you approximately 3.3 months (without missing a day) to finish it. If you aim to complete it in 2 months, you'll need to read at least 17 pages per day.
</aside>
<aside> 📔 Jupyter notebook for this chapter: on Github, on Colab, on Kaggle.
</aside>
We use dataset MNIST in this chapter = 70K small images of digits handwriten. ← “Hello world” of ML.
Download from OpenML.org. ← use [sklearn.datasets.fetch_openml](<https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html>)
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', as_frame=False)
# data contains images -> dataframe isn't suitable, so as_frame=False
X, y = mnist.data, mnist.target
X.shape # (70000, 784)
[sklean.datasets](<https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets>)
contains 3 types of functions:
fetch_*
functions such as fetch_openml()
to download real-life datasets.load_*
functions to load small toy datasets (no need to download)make_*
functions to generate fake datasets.70K images, 784 features. Each image = 28x28 pixels.
Plot an image
import matplotlib.pyplot as plt def plot_digit(image_data):
image = image_data.reshape(28, 28) plt.axis("off")
plt.imshow(image, cmap="binary")
some_digit = X[0] plot_digit(some_digit) plt.show()
y[0]
= 5
MNIST from fetch_openml()
is already split into a training set (first 60K, already shuffled) and test set (last 10K).
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
Training set is already shuffled ← good for cross-validation (all are similar).
Let’s simplify the problem - “detect only the number 5” ← binary classifier (2 classes, 5 or non-5).
Good to start is stochastic gradient descent (SGD, or stochastic GD) classifier ← [SGDClassifier](<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html>)
← deals with training instances independently, one at a time ← handling large datasets effeciently, well suited for online training.
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
sgd_clf.predict([some_digit])
Evaluating a classifier is often significantly trickier than evaluating a regressor!
Use [cross_val_score()](<https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html>)
← use k-folds.
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
Wow, get 95% accuracy with SGD but it’s good? → Let’s try [DummyClassifier](<https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html>)
← classifies every single image in the most frequent class (non-5) and then use cross_val_score
→ 90% accuracy! Why? It’s because only 10% are 5s! ← If you always guess that an image is not a 5, 90% of the time, you’re right!