What is Unsupervised Learning?

Unsupervised learning is the second most widely used form of machine learning after supervised learning. The key difference: data has no output labels (Y).

Supervised Learning Unsupervised Learning
Data includes labels (X → Y) Data has no labels (just X)
Algorithm learns "right answers" Algorithm finds patterns on its own
Example: tumor size + diagnosis Example: tumor size + age, no diagnosis

Instead of predicting a specific output, unsupervised learning finds structure, patterns, or something interesting in the data without being told what to look for.

Clustering

Clustering is a type of unsupervised learning that groups unlabeled data into clusters based on similarities the algorithm discovers itself.

Application 1: Google News

Google News processes hundreds of thousands of articles daily and groups related stories together automatically.

For a story about panda twins born at a zoo, the algorithm noticed articles sharing words like "panda," "twin," and "zoo" and clustered them together. No human tells the algorithm which words matter, it figures this out on its own. This would be impossible to do manually given how many stories and topics exist each day.

Application 2: DNA/Genetic Analysis

DNA microarray data shows gene expression levels across many individuals:

A clustering algorithm can group people into types (Type 1, Type 2, Type 3) based on genetic similarities. Researchers don't define the types in advance, the algorithm discovers them.

Three Types of Unsupervised Learning

1. Clustering

Groups similar data points together.

2. Anomaly Detection

Identifies unusual events or outliers in data.