Clustering | Notion

Given a set of points with a notion of distance between the points, we want to group the points into some number of clusters such that:

Typically:

Points are in some high-dimensional space
Similarity is defined with some distance measure
- e.g. Euclidian distance, Cosine, Jaccard, edit, etc.

Note that clustering is an unsupervised ML method:

Clustering is a hard problem due to dimensionality (”the curse of dimensionality”).

in 2D - problem looks easy
- Clustering small amounts of data is easy
Many applications involve many dimensions (e.g. 10-10,000 dimensions)
High dimensional spaces look different
- e.g. in higher dimensions, almost all pairs of points are at about the same distance
e.g. Clustering galaxies
- Given a catalog of 2 billion “sky objects” - represent objects by their radiation in 7 dimensions
- Want to cluster into similar objects (e.g. galaxies, nearby stars, quasars, etc.)
e.g. Music CDs
- Want to categorize music CDs (clusters of similar CDs)
- Can represent CDs as high-dimensional objects (values in dimension are 0/1)
  - e.g. based on customers, we have $(x_1, x_2, ..., x_k)$, where $x_i = 1$ iff the $i$th customer bought the CD
e.g. Clustering Documents
- We represent documents by vector $(x_1, x_2, ..., x_k)$ where $x_i = 1$ iff the $i$th word (in some order) appears in the document
  - Does not matter if $k$ is infinite (don’t limit set of words)
- Idea - documents with similar sets of words may be about the same topic

We must quantify the distance between objects.

e.g. with Documents - can think of documents as sets of words or n-grams ⇒ sets
- Can measure similarity with Jaccard distance
- Can measure similarity with cosine distance
- Can measure similarity by Euclidean distance