Given a set of points with a notion of distance between the points, we want to group the points into some number of clusters such that:
- Members of a cluster are close/similar to each other
- Members of different clusters are dissimilar
Typically:
- Points are in some high-dimensional space
- Similarity is defined with some distance measure
- e.g. Euclidian distance, Cosine, Jaccard, edit, etc.

Note that clustering is an unsupervised ML method:
- No labels on data (i.e. what measurements look similar)
Clustering is a hard problem due to dimensionality (”the curse of dimensionality”).
- in 2D - problem looks easy
- Clustering small amounts of data is easy
- Many applications involve many dimensions (e.g. 10-10,000 dimensions)
- High dimensional spaces look different
- e.g. in higher dimensions, almost all pairs of points are at about the same distance
- e.g. Clustering galaxies
- Given a catalog of 2 billion “sky objects” - represent objects by their radiation in 7 dimensions
- Want to cluster into similar objects (e.g. galaxies, nearby stars, quasars, etc.)
- e.g. Music CDs
- Want to categorize music CDs (clusters of similar CDs)
- Can represent CDs as high-dimensional objects (values in dimension are 0/1)
- e.g. based on customers, we have $(x_1, x_2, ..., x_k)$, where $x_i = 1$ iff the $i$th customer bought the CD
- e.g. Clustering Documents
- We represent documents by vector $(x_1, x_2, ..., x_k)$ where $x_i = 1$ iff the $i$th word (in some order) appears in the document
- Does not matter if $k$ is infinite (don’t limit set of words)
- Idea - documents with similar sets of words may be about the same topic
We must quantify the distance between objects.
- e.g. with Documents - can think of documents as sets of words or n-grams ⇒ sets
- Can measure similarity with Jaccard distance
- Can measure similarity with cosine distance
- Can measure similarity by Euclidean distance