转自:ChallengeHub
A.监督学习
1.EDA(Exploratory Data Analysis) 2.K-Nearest Neighbors(KNN) B.无监督学习
监督学习可以查看机器学习全面教程-有监督学习篇
无监督学习无监督学习 : 它使用无标记的数据,并从无标记的数据中发现隐藏的模式信息。例如,有些骨科病人数据没有标签,你不知道哪个骨科病人是正常的,哪个是不正常的。 1.Kmeans聚类让我们尝试我们的第一个无监督的方法,即KMeans Cluster
KMeans聚类:该算法基于所提供的特征迭代地将每个数据点分配给K组中的一个。基于特征相似性对数据点进行聚类
KMeans(n_clusters = 2): n_clusters = 2表示创建2个簇群
先看数据的分布
# As you can see there is no labels in data data = pd.read_csv('column_2C_weka.csv' ) plt.scatter(data['pelvic_radius' ],data['degree_spondylolisthesis' ]) plt.xlabel('pelvic_radius' ) plt.ylabel('degree_spondylolisthesis' ) plt.show()
再看通过Kmeans聚类分类效果
# KMeans Clustering data2 = data.loc[:,['degree_spondylolisthesis' ,'pelvic_radius' ]] from sklearn.cluster import KMeans kmeans = KMeans(n_clusters = 2) kmeans.fit(data2) labels = kmeans.predict(data2) plt.scatter(data['pelvic_radius' ],data['degree_spondylolisthesis' ],c = labels) plt.xlabel('pelvic_radius' ) plt.xlabel('degree_spondylolisthesis' ) plt.show()
2.聚类效果评价我们把数据分成两组。这是正确的聚类吗? 为了评估聚类效果,我们将使用交叉表。
# cross tabulation table
df = pd.DataFrame({'labels' :labels,"class" :data['class' ]}) ct = pd.crosstab(df['labels' ],df['class' ])print (ct)
上面存在一个问题是,我们事先知道了数据中是几分类问题,但是如果不知道呢? 那这有点像KNN或回归中的超参数问题了。
更小的inertia,没有太多的簇群平衡,所以我们可以选择elbow # inertia inertia_list = np.empty(8)for i in range(1,8): kmeans = KMeans(n_clusters=i) kmeans.fit(data2) inertia_list[i] = kmeans.inertia_ plt.plot(range(0,8),inertia_list,'-o' ) plt.xlabel('Number of cluster' ) plt.ylabel('Inertia' ) plt.show()
3.标准化标准化对于监督学习和非监督学习都很重要,可以消除量纲影响 from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline scalar = StandardScaler() kmeans = KMeans(n_clusters = 2) pipe = make_pipeline(scalar,kmeans) pipe.fit(data3) labels = pipe.predict(data3) df = pd.DataFrame({'labels' :labels,"class" :data['class' ]}) ct = pd.crosstab(df['labels' ],df['class' ])print (ct)
4.层次分析法from scipy.cluster.hierarchy import linkage,dendrogram merg = linkage(data3.iloc[200:220,:],method = 'single' ) dendrogram(merg, leaf_rotation = 90, leaf_font_size = 6) plt.show()
5.T-SNEfrom sklearn.manifold import TSNE model = TSNE(learning_rate=100) transformed = model.fit_transform(data2) x = transformed[:,0] y = transformed[:,1] plt.scatter(x,y,c = color_list ) plt.xlabel('pelvic_radius' ) plt.xlabel('degree_spondylolisthesis' ) plt.show()
6.PCATransform():应用学到的转换。也可以应用测试集 # PCA from sklearn.decomposition import PCA model = PCA() model.fit(data3) transformed = model.transform(data3)print ('Principle components: ' ,model.components_)
# PCA variance scaler = StandardScaler() pca = PCA() pipeline = make_pipeline(scaler,pca) pipeline.fit(data3) plt.bar(range(pca.n_components_), pca.explained_variance_) plt.xlabel('PCA feature' ) plt.ylabel('variance' ) plt.show()
第二步:内在维度: 需要特征数量来近似数据背后的基本思想 当样本具有任意数量的特征时,主成分分析识别内在维数 # apply PCA pca = PCA(n_components = 2) pca.fit(data3) transformed = pca.transform(data3) x = transformed[:,0] y = transformed[:,1] plt.scatter(x,y,c = color_list) plt.show()