remember to Standardlization and Visualization to understand the data pattern

hyperparameters: k, distance metric

rule of thumb: choose $k=\sqrt n$

supp-k-nearest-neighbor-square-root-of-n-full.pdf

Cons: work bad with catagorical attributes; struggle with high dimension.

Speed up:

  1. Reduce the dimension of training data(e.g. Principle-Component Analysis)

supp-PCA-full.pdf

Ties breaking

  1. k-neighbor selection
  2. major voting
    1. if binary classification: odd value K
    2. randomly choose from the candidate classes
    3. weighted by distance: harder to get a tie

Sklearn Code

height = np.array([158, 158, 158, 160, 160, 163, 163, 160, 163, 165, 165, 165, 168, 168, 168, 170, 170, 170])
h_mean = np.mean(height); h_std = np.std(height,ddof=1) # ddof=1 sample mean
height = (height - h_mean)/h_std
weight = np.array([58, 59, 63, 59, 60, 60, 61, 64, 64, 61, 62, 65, 62, 63, 66, 63, 64, 68])
w_mean = np.mean(weight); w_std = np.std(weight,ddof=1)
weight = (weight - w_mean)/w_std
size_label = np.array(['M','M','M','M','M','M','M','L','L','L','L','L','L','L','L','L','L','L'])

encoder = sklearn.preprocessing.LabelEncoder()
label = encoder.fit_transform(size_label)
features = list(zip(height, weight))

def predict(k):
  model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=k)
  model.fit(features, label) # features.shape=(n_samples, n_features) or (n_samples, n_samples) if metric=’precomputed’
  predicted = model.predict([[153, 58]])

  predicted = encoder.inverse_transform(predicted)
  return predicted

Vectorization Code

$dists_{i,j}=\sum_k (x_{i,k}-x'_{j,k})^2, \bold x_i \in Test, \bold x'_j \in Train$

# train
dists=np.sqrt(\\
  np.linalg.norm(X,axis=1,keepdims=True)**2\\
  +np.linalg.norm(self.X_train,axis=1)**2\\
  -2* X@self.X_train.T)

# predict
closest_y=self.y_train[ np.argsort(dists,axis=1)[:,:k] ]
for i in np.arange(num_pred)
	y_pred[i]=np.argmax(np.bincount(closest_y[i]))
# bincount: Count number of occurrences of each value in array of non-negative ints.

Demo using Voronoi diagram

http://vision.stanford.edu/teaching/cs231n-demos/knn/

similarity search

GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.