Reading: Hands-On ML - Chap 3: Classification

👉 List of all notes for this book. IMPORTANT UPDATE November 18, 2024: I've stopped taking detailed notes from the book and now only highlight and annotate directly in the PDF files/book. With so many books to read, I don't have time to type everything. In the future, if I make notes while reading a book, they'll contain only the most notable points (for me).

<aside> 📔 Jupyter notebook for this chapter: on Github, on Colab, on Kaggle.

</aside>

MNIST

We use dataset MNIST in this chapter = 70K small images of digits handwriten. ← “Hello world” of ML.

Download from OpenML.org. ← use sklearn.datasets.fetch_openml

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', as_frame=False)
# data contains images -> dataframe isn't suitable, so as_frame=False
X, y = mnist.data, mnist.target
X.shape # (70000, 784)

sklean.datasets contains 3 types of functions:
- fetch_* functions such as fetch_openml() to download real-life datasets.
- load_* functions to load small toy datasets (no need to download)
- make_* functions to generate fake datasets.
70K images, 784 features. Each image = 28x28 pixels.

Plot an image

import matplotlib.pyplot as plt def plot_digit(image_data):

image = image_data.reshape(28, 28) plt.axis("off")
plt.imshow(image, cmap="binary")
some_digit = X[0] plot_digit(some_digit) plt.show()

= 5

y[0] = 5

MNIST from fetch_openml() is already split into a training set (first 60K, already shuffled) and test set (last 10K).
```
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
```
Training set is already shuffled ← good for cross-validation (all are similar).

Training a Binary Classifier

Let’s simplify the problem - “detect only the number 5” ← binary classifier (2 classes, 5 or non-5).

Good to start is stochastic gradient descent (SGD, or stochastic GD) classifier ← SGDClassifier ← deals with training instances independently, one at a time ← handling large datasets effeciently, well suited for online training.

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

sgd_clf.predict([some_digit])

Performance Measures

Evaluating a classifier is often significantly trickier than evaluating a regressor!

Measuring Accuracy Using Cross-Validation

Use cross_val_score() ← use k-folds.

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

Wow, get 95% accuracy with SGD but it’s good? → Let’s try DummyClassifier ← classifies every single image in the most frequent class (non-5) and then use cross_val_score → 90% accuracy! Why? It’s because only 10% are 5s! ← If you always guess that an image is not a 5, 90% of the time, you’re right!