Similarity and MinHashing

Many problems involve finding a “similar set”:

e.g. Near Neighbors
- Given a set $S$ of objects, distance $D(a, b)$ from object $a$ to $b$, and max distance threshold $t$, we want to find all unordered pairs $(a, b)$ such that $D(a, b) \leq t$
e.g. Scene Completion
- Given an incomplete image (i.e. portion is missing), can we “repair” it with similar images?
- Want to find images with similar features
e.g. Pages with similar words
e.g. Customers who purchased similar products
- Find products with similar customer sets

Given high-dimensional datapoints $x_1, x_2, ..., x_n$ and some distance function $d(x_1, x_2)$:

Want to find all pairs of datapoints $(x_i, x_j)$ such that $d(x_i, x_j) < s$
Naïve approach of computing all pairs of distances is $O(n^2)$
- Not scalable
It is possible to achieve this in $O(n)$
Core technique used is applicable to machine learning
- Can create signatures for objects

Recall that we can find identical items in a collection of $n$ via hash table

Locality Sensitive Hashing (LSH)

Suppose that for objects $X, X'$, we have that $d(X, X') = c$.

Given $X, h(X)$: