Unsupervised learning is the second most widely used form of machine learning after supervised learning. The key difference: data has no output labels (Y).
| Supervised Learning | Unsupervised Learning |
|---|---|
| Data includes labels (X → Y) | Data has no labels (just X) |
| Algorithm learns "right answers" | Algorithm finds patterns on its own |
| Example: tumor size + diagnosis | Example: tumor size + age, no diagnosis |
Instead of predicting a specific output, unsupervised learning finds structure, patterns, or something interesting in the data without being told what to look for.
Clustering is a type of unsupervised learning that groups unlabeled data into clusters based on similarities the algorithm discovers itself.
Google News processes hundreds of thousands of articles daily and groups related stories together automatically.
For a story about panda twins born at a zoo, the algorithm noticed articles sharing words like "panda," "twin," and "zoo" and clustered them together. No human tells the algorithm which words matter, it figures this out on its own. This would be impossible to do manually given how many stories and topics exist each day.
DNA microarray data shows gene expression levels across many individuals:
A clustering algorithm can group people into types (Type 1, Type 2, Type 3) based on genetic similarities. Researchers don't define the types in advance, the algorithm discovers them.
Groups similar data points together.
Identifies unusual events or outliers in data.