社区所有版块导航
Python
python开源   Django   Python   DjangoApp   pycharm  
DATA
docker   Elasticsearch  
aigc
aigc   chatgpt  
WEB开发
linux   MongoDB   Redis   DATABASE   NGINX   其他Web框架   web工具   zookeeper   tornado   NoSql   Bootstrap   js   peewee   Git   bottle   IE   MQ   Jquery  
机器学习
机器学习算法  
Python88.com
反馈   公告   社区推广  
产品
短视频  
印度
印度  
Py学习  »  机器学习算法

机器学习理论(一)k近邻算法

小白学视觉 • 5 月前 • 470 次点击  

点击上方小白学视觉”,选择加"星标"或“置顶

重磅干货,第一时间送达

前言


KNN(k-Nearest Neighbors)思想简单,应用的数学知识几乎为0,所以作为机器学习的入门非常实用、可以解释机器学习算法使用过程中的很多细节问题。能够更加完整地刻画机器学习应用的流程。


首先大致介绍一下KNN的思想,假设我们现在有两类数据集,一类是红色的点表示,另一类用蓝色的点表示,这两类点就作为我们的训练数据集,当有一个新的数据绿色的点,那么我们该怎么给这个绿色的点进行分类呢?


一般情况下,我们需要先指定一个k,当一个新的数据集来临时,我们首先计算这个新的数据跟训练集中的每一个数据的距离,一般使用欧氏距离。


然后从中选出距离最近的k个点,这个k一般选取为奇数,方便后面投票决策。在k个点中根据最多的确定新的数据属于哪一类。




KNN基础


1.先创建好数据集x_train, y_train,和一个新的数据x_new, 并使用matplot将其可视化出来。


import numpy as npimport matplotlib.pyplot as plt
raw_data_x = [[3.3935, 2.3313],              [3.1101, 1.7815],              [1.3438, 3.3684],              [3.5823, 4.6792],              [2.2804, 2.8670],              [7.4234, 4.6965],              [5.7451, 3.5340],              [9.1722, 2.5111],              [7.7928, 3.4241],              [7.9398, 0.7916]]raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]x_train = np.array(raw_data_x)y_train = np.array(raw_data_y)
x_new = np.array([8.0936, 3.3657])
plt.scatter(x_train[y_train==0,0], x_train[y_train==0,1], color='g')plt.scatter(x_train[y_train==1,0], x_train[y_train==1,1], color='r')plt.scatter(x_new[0], x_new[1], color='b')plt.show()


1.knn过程


2.计算距离


from math import sqrt


    
distance = []for x in x_train:    d = sqrt(np.sum((x_new - x) ** 2))    distance.append(d)
# 其实上面这些代码用一行就可以搞定# distances = [sqrt(np.sum((x_new - x) ** 2)) for x in x_train]


输出结果:


[10.888422144185997, 11.825242797930196, 15.18734646375067, 11.660703691887552, 12.89974598548359, 12.707715895864213, 9.398411207752083, 15.62480440229573, 12.345673749536719, 14.394770082568183]


将距离进行排序,返回的是排序之后的索引位置


nearsest = np.argsort(distances)


输出结果:array([6, 0, 3, 1, 8, 5, 4, 9, 2, 7], dtype=int64)


取k个点,假设k=5


k = 5topk_y = [y_train[i] for i in nearest[:k]]topk_y


输出结果:[1, 0, 0, 0, 1]


根据输出结果我们可以发现,新来的数据距离最近的5个点,有三个点属于第一类,有两个点属于第二类,根据少数服从多数原则,新来的数据就属于第一类!


投票


from collections import CounterCounter(topk_y)


输出结果:Counter({1: 2, 0: 3})


votes = Counter(topk_y)votes.most_common(1)y_new = votes.most_common(1)[0][0]


输出结果:0


这样,我们就完成了一个基本的knn!



自己写一个knn函数


knn是一个不需要训练过程的机器学习算法。其数据集可以近似看成一个模型。


import numpy as npfrom math import sqrtfrom collections import Counter
def kNN_classifier(k, x_train, y_train, x_new):
   assert 1 <= k <= x_train.shape[0], "k must be valid"    assert x_train.shape[0] == y_train.shape[0], "the size of x_train must be equal to the size of y_train"    assert x_train.shape[1] == x_new.shape[0], "the feature number of x_new must be equal to x_train"
   distances = [sqrt(np.sum((x_new - x) ** 2)) for x in x_train]    nearest = np.argsort(distances)
   topk_y = [y_train[i] for i in nearest[:k]]    votes = Counter(topk_y)
   return votes.most_common(1)[0][0]


测试一下:


raw_data_x = [[3.3935, 2.3313],              [3.1101, 1.7815],              [1.3438, 3.3684],              [3.5823, 4.6792],              [2.2804, 2.8670],              [7.4234, 4.6965],              [5.7451


    
, 3.5340],              [9.1722, 2.5111],              [7.7928, 3.4241],              [7.9398, 0.7916]]raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]x_train = np.array(raw_data_x)y_train = np.array(raw_data_y)
x_new = np.array([8.0936, 3.3657])
y_new = kNN_classifier(5, x_train, y_train, x_new)print(y_new)



使用sklearn中的KNN


from sklearn.neighbors import KNeighborsClassifierimport numpy as np
raw_data_x = [[3.3935, 2.3313],              [3.1101, 1.7815],              [1.3438, 3.3684],              [3.5823, 4.6792],              [2.2804, 2.8670],              [7.4234, 4.6965],              [5.7451, 3.5340],              [9.1722, 2.5111],              [7.7928, 3.4241],              [7.9398 , 0.7916]]raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]x_train = np.array(raw_data_x)y_train = np.array(raw_data_y)
x_new = np.array([8.0936, 3.3657])
knn_classifier = KNeighborsClassifier(n_neighbors=5)knn_classifier.fit(x_train, y_train)
y_new = knn_classifier.predict(x_new.reshape(1, -1))print(y_new[0])



自己写一个面向对象的KNN


import numpy as npfrom math import sqrtfrom collections import Counter
class KNNClassifier():
   def __init__(self, k):        assert 1 <= k, "k must be valid"        self.k = k        self._x_train = None        self._y_train = None
   def fit(self, x_train, y_train):        assert x_train.shape[0] == y_train.shape[0], \        "the size of x_train must be equal to the size of y_train"        assert self.k <= x_train.shape[0], \         "the size of x_train must be at least k"
       self._x_train = x_train        self._y_train = y_train        return self
   def predict(self, x_new):        x_new = x_new.reshape(1, -1)        assert self._x_train is not None and self._y_train is not None, \        "must fit before predict"        assert x_new.shape[1] == self._x_train.shape[1], \        "the feature number of x must be equal to x_train"
       y_new = [self._predict(x) for x in x_new]        return np.array(y_new)
   def _predict(self, x):        assert x.shape[0] == self._x_train.shape[1], \        "the feature number of x must be equal to x_train"
       distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._x_train]        nearest = np.argsort(distances)
       topk_y = [self._y_train[i] for i in nearest[:self.k]]        votes = Counter(topk_y)
       return votes.most_common(1)[0][0]
   def __repr__(self):        return "KNN(k=%d)" % self.k


测试一下:


raw_data_x = [[3.3935, 2.3313],              [3.1101, 1.7815],              [1.3438, 3.3684],              [3.5823, 4.6792],              [2.2804, 2.8670],              [7.4234, 4.6965],              [5.7451, 3.5340],              [9.1722, 2.5111],              [7.7928, 3.4241],              [7.9398, 0.7916]]raw_data_y = [0, 0, 0, 0


    
, 0, 1, 1, 1, 1, 1]x_train = np.array(raw_data_x)y_train = np.array(raw_data_y)
x_new = np.array([8.0936, 3.3657])
knn_clf = KNNClassifier(6)knn_clf.fit(x_train, y_train)y_new = knn_clf.predict(x_new)print(y_new[0])



分割数据集


import numpy as npfrom sklearn import datasets

def train_test_split(x, y, test_ratio=0.2, seed=None):
   assert x.shape[0] == y.shape[0], "the size of x must be equal to the size of y"    assert 0.0 <= test_ratio <= 1.0, "test_ratio must be valid"
   if seed:        np.random.seed(seed)
   shuffle_idx = np.random.permutation(len(x))
   test_size = int(len(x) * test_ratio)    test_idx = shuffle_idx[:test_size]    train_idx = shuffle_idx[test_size:]
   x_train = x[train_idx]    y_train = y[train_idx]
   x_test = x[test_idx]    y_test = y[test_idx]
   return x_train, y_train, x_test, y_test



sklearn中鸢尾花数据测试KNN


import numpy as npfrom sklearn import datasetsfrom knn_clf import KNNClassifier
iris = datasets.load_iris()x = iris.datay = iris.target
x_train, y_train, x_test, y_test = train_test_split(x, y)my_knn_clf = KNNClassifier(k=3)my_knn_clf.fit(x_train, y_train)
y_predict = my_knn_clf.predict(x_test)print(sum(y_predict == y_test))print(sum(y_predict == y_test) / len(y_test))# 也可以使用sklearn中自带的数据集拆分方法from sklearn.model_selection import train_test_splitimport numpy as npfrom sklearn import datasetsfrom knn_clf import KNNClassifier
iris = datasets.load_iris()x = iris.datay = iris.targetx_train, y_train, x_test, y_test = train_test_split(x, y, \                                                    test_size=0.2, random_state=666)my_knn_clf = KNNClassifier(k=3)my_knn_clf.fit(x_train, y_train)y_predict = my_knn_clf.predict(x_test)print(sum(y_predict == y_test))print(sum(y_predict == y_test) / len(y_test))



sklearn中手写数字数据集测试KNN


首先,先来了解一下手写数字数据集。


import numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasets
digits = datasets.load_digits()digits.keys()print(digits.DESCR)y.shapedigits.target_namesy[:100]x[:10]some_digit = x[666]y[666]some_digit_image = some_digit.reshape(8, 8)plt.imshow(some_digit_image, cmap=plt.cm.binary)plt.show()


接下来,就开始动手试试。


from sklearn import datasetsfrom shuffle_dataset import train_test_splitfrom knn_clf import KNNClassifier
digits = datasets.load_digits()x = digits.datay = digits.target
x_train, y_train, x_test, y_test = train_test_split(x, y, test_ratio=0.2)my_knn_clf = KNNClassifier(k=3)my_knn_clf.fit(x_train, y_train)y_predict = my_knn_clf.predict(x_test)
print(sum(y_predict == y_test) / len(y_test))


把求acc封装成一个函数,方便调用。


def accuracy_score(y_true, y_predict):    assert y_true.shape[0] == y_predict.shape[0], \     "the size of y_true must be equal to the size of y_predict"
   return sum(y_true == y_predict) / len(y_true)


接下来把它封装到KNNClassifier的类中。


import numpy as npfrom math import sqrtfrom collections import Counterfrom metrics import accuracy_score
class KNNClassifier():
   def __init__(self, k):        assert 1 <= k, "k must be valid"        self.k = k        self._x_train = None        self._y_train = None
   def fit(self, x_train, y_train):        assert x_train.shape[0] == y_train.shape[0], \        "the size of x_train must be equal to the size of y_train"        assert self.k <= x_train.shape[0], \        "the size of x_train must be at least k"
       self._x_train = x_train        self._y_train = y_train        return self
   def predict(self, x_new):        # x_new = x_new.reshape(1, -1)        assert self._x_train is not None and self._y_train is not None, \        "must fit before predict"        assert x_new.shape[1] == self._x_train.shape[1], \        "the feature number of x must be equal to x_train"
       y_new = [self._predict(x) for x in x_new]        return np.array(y_new)
   def _predict(self, x):        assert x.shape[0] == self._x_train.shape[1], \        "the feature number of x must be equal to x_train"
       distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._x_train]        nearest = np.argsort(distances)
       topk_y = [self._y_train[i] for i in nearest[:self.k]]        votes = Counter(topk_y)
       return votes.most_common(1)[0][0]
   def score(self, x_test, y_test):        y_predict = self.predict(x_test)        return accuracy_score(y_test, y_predict)
   def __repr__(self):        return "KNN(k=%d)" % self.k


 其实,在sklearn中这些都已经封装好了。


from sklearn.metrics import accuracy_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifier
digits = datasets.load_digits()x = digits.datay = digits.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)knn_classifier = KNeighborsClassifier(n_neighbors=3)knn_classifier.fit(x_train, y_train)knn_classifier.score(x_test, y_test)



超参数


k


在knn中的超参数k何时最优?


from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier 
digits = datasets.load_digits() x = digits.data y = digits.target x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
best_score = 0.0 best_k = -1 for k in range(1, 11):    knn_clf = KNeighborsClassifier(n_neighbors=k)    knn_clf.fit(x_train, y_train)    score = knn_clf.score(x_test, y_test)    if score > best_score:        best_k = k        best_score = score print("best k=", best_k) print("best score=", best_score)


投票方式



上面这张图,绿色的球最近的三颗球分别是红色的1号,紫色的3号和蓝色的4号。如果只考虑绿色的k个近邻中多数服从少数,目前来说就是平票。


即使不是平票,红色也是距离绿色最近。此时我们就可以考虑给他们加个权重。一般使用距离的倒数作为权重。假设距离分别为1、 3、 4


红球:1 紫+蓝:1/3 + 1/4 = 7/12


这两者加起来都没有红色的权重大,因此最终将这颗绿球归为红色类别。这样能有效解决平票问题。 因此,这也算knn的一个超参数。


其实这个在sklearn封装的knn中已经考虑到了这个问题。在KNeighborsClassifier(n_neighbors=k,weights=?)


还有一个参数weights,一般有两种:uniform、distance。


from sklearn.metrics import accuracy_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifier
digits = datasets.load_digits()x = digits.datay = digits.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
best_method = ""best_score = 0.0best_k = -1for method in["uniform", "distance"]:    for k in range(1, 11):        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)        knn_clf.fit(x_train, y_train)        score = knn_clf.score(x_test, y_test)        if score > best_score:            best_method = method            best_k = k            best_score = scoreprint("best_method=", best_method)print("best k=", best_k)print("best score=", best_score)


p


如果使用距离,那么有很多种距离可以使用,欧氏距离、曼哈顿距离、明可夫斯基距离。


from sklearn.metrics import accuracy_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifier
digits = datasets.load_digits()x = digits.datay = digits.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
best_p = -1best_score = 0.0best_k = -1for p in range(1, 6):    for k in range(1, 11):        knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p=p)        knn_clf.fit(x_train, y_train)        score = knn_clf.score(x_test, y_test)        if score > best_score:            best_p = p            best_k = k            best_score = scoreprint("best_p=", best_p)print("best k=", best_k)print("best score=", best_score)


下载1:OpenCV-Contrib扩展模块中文版教程
在「小白学视觉」公众号后台回复:扩展模块中文教程即可下载全网第一份OpenCV扩展模块教程中文版,涵盖扩展模块安装、SFM算法、立体视觉、目标跟踪、生物视觉、超分辨率处理等二十多章内容。
下载2:Python视觉实战项目52讲
小白学视觉公众号后台回复:Python视觉实战项目即可下载包括图像分割、口罩检测、车道线检测、车辆计数、添加眼线、车牌识别、字符识别、情绪检测、文本内容提取、面部识别等31个视觉实战项目,助力快速学校计算机视觉。
下载3:OpenCV实战项目20讲
小白学视觉公众号后台回复:OpenCV实战项目20讲即可下载含有20个基于OpenCV实现20个实战项目,实现OpenCV学习进阶。

交流群


欢迎加入公众号读者群一起和同行交流,目前有SLAM、三维视觉、传感器自动驾驶、计算摄影、检测、分割、识别、医学影像、GAN算法竞赛等微信群(以后会逐渐细分),请扫描下面微信号加群,备注:”昵称+学校/公司+研究方向“,例如:”张三 + 上海交大 + 视觉SLAM“。请按照格式备注,否则不予通过。添加成功后会根据研究方向邀请进入相关微信群。请勿在群内发送广告,否则会请出群,谢谢理解~


Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/173420
 
470 次点击