remember to Standardlization and Visualization to understand the data pattern
hyperparameters: k, distance metric
rule of thumb: choose $k=\sqrt n$
supp-k-nearest-neighbor-square-root-of-n-full.pdf
Cons: work bad with catagorical attributes; struggle with high dimension.
Speed up:
height = np.array([158, 158, 158, 160, 160, 163, 163, 160, 163, 165, 165, 165, 168, 168, 168, 170, 170, 170])
h_mean = np.mean(height); h_std = np.std(height,ddof=1) # ddof=1 sample mean
height = (height - h_mean)/h_std
weight = np.array([58, 59, 63, 59, 60, 60, 61, 64, 64, 61, 62, 65, 62, 63, 66, 63, 64, 68])
w_mean = np.mean(weight); w_std = np.std(weight,ddof=1)
weight = (weight - w_mean)/w_std
size_label = np.array(['M','M','M','M','M','M','M','L','L','L','L','L','L','L','L','L','L','L'])
encoder = sklearn.preprocessing.LabelEncoder()
label = encoder.fit_transform(size_label)
features = list(zip(height, weight))
def predict(k):
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=k)
model.fit(features, label) # features.shape=(n_samples, n_features) or (n_samples, n_samples) if metric=’precomputed’
predicted = model.predict([[153, 58]])
predicted = encoder.inverse_transform(predicted)
return predicted
$dists_{i,j}=\sum_k (x_{i,k}-x'_{j,k})^2, \bold x_i \in Test, \bold x'_j \in Train$
# train
dists=np.sqrt(\\
np.linalg.norm(X,axis=1,keepdims=True)**2\\
+np.linalg.norm(self.X_train,axis=1)**2\\
-2* X@self.X_train.T)
# predict
closest_y=self.y_train[ np.argsort(dists,axis=1)[:,:k] ]
for i in np.arange(num_pred)
y_pred[i]=np.argmax(np.bincount(closest_y[i]))
# bincount: Count number of occurrences of each value in array of non-negative ints.
http://vision.stanford.edu/teaching/cs231n-demos/knn/