KS检验及其在机器学习中的应用

什么是KS检验

Kolmogorov–Smirnov 检验，简称KS检验，是统计学中的一种非参数假设检验，用来检测单样本是否服从某一分布，或者两样本是否服从相同分布。在单样本的情况下，我们想检验这个样本是否服从某一分布函数，记是该样本的经验分布函数。我们构造KS统计量：

如下图，经验分布函数与目标分布的累积分布函数的最大差值就是我们要求的KS统计量：

95%置信度的KS统计量的临界值由给出，如果我们根据样本得到的KS统计量的值小于，那么我们就接收原假设！否则，拒绝原假设。

两样本的KS检验

用同样的思想，我们可以检验「两个样本是否服从同一分布」，此时KS统计量为两样本的经验分布函数的最大差值

这时候，95%置信度的临界值为

「KS 检验只能检验连续型的分布」




    
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import kstest, ks_2samp
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

如何用Python进行KS检验

Python的scipy.stats模块提供了与KS检验有关的函数

单样本检验

有函数：scipy.stats.kstest(rvs, cdf, args=(), N=20, alternative='two-sided', mode='approx')最重要的两个参数：

rvs : str, array or callableIf a string, it should be the name of a distribution in scipy.stats.If an array, it should be a 1-D array of observations of randomvariables.If a callable, it should be a function to generate random variables;it is required to have a keyword argument size.
cdf : str or callableIf a string, it should be the name of a distribution in scipy.stats.If rvs is a string then cdf can be False or the same as rvs.If a callable, that callable is used to calculate the cdf.```

Returns:
    statistic : float
        KS test statistic, either D, D+ or D-.
    pvalue :  float
        One-tailed or two-tailed p-value.

x = np.random.randn(100)
kstest(x, 'norm')

KstestResult(statistic=0.14648390717722642, pvalue=0.024536061749414313)

生成100个标准正态分布随机数，得到KS统计量的值为，因此我们认为该样本服从正态分布。

x = np.random.exponential(size=100)

kstest(x, 'norm')

KstestResult(statistic=0.505410956721057, pvalue=3.4967106846361894e-24)

kstest(x, 'expon')

KstestResult(statistic=0.09854002120537766, pvalue=0.2685899206780503)

生成100个指数分布随机数，KS检验拒绝它们服从正态分布的假设，接收了它们服从指数分布的假设。

两样本检验

有函数：scipy.stats.ks_2samp(data1, data2, alternative='two-sided', mode='auto')参数：

data1, data2 : sequence of 1-D ndarraystwo arrays of sample observations assumed to be drawn from a continuousdistribution, sample sizes can be different`

Returns
statistic : float
KS statistic
pvalue : float
two-tailed p-value

x = np.random.randn(100)
y = np.random.randn(50)
ks_2samp(x, y)

Ks_2sampResult(statistic=0.11, pvalue=0.804177768619009)

，因此我们接收原假设，认为x,y服从相同分布。

x = np.random.randn(100)
y = np.random.exponential(size=50)
ks_2samp(x, y)

Ks_2sampResult(statistic=0.59, pvalue=3.444644569583488e-11)

拒绝x,y服从相同分布的假设。

KS检验在机器学习中的应用

应用一：判断特征在训练集和测试集上分布是不是相同

特征迁移是在机器学习任务中经常碰到的情况，「线上数据的分布跟离线数据的分布情况不一致」，这就导致模型的泛化能力不足。而我们去判断两份数据的分布是不是相同的一个工具就是KS检验！

X, y = datasets.make_classification(n_samples=10000, n_features=5,
                                    n_informative=2, n_redundant=2, random_state=2020)
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.4, random_state=2020)

for i in range(5):
    print(ks_2samp(X_train[:, i], X_test[:, i]))

Ks_2sampResult(statistic=0.013083333333333334, pvalue=1.0)
Ks_2sampResult(statistic=0.013083333333333334, pvalue=1.0)
Ks_2sampResult(statistic=0.008916666666666666, pvalue=1.0)
Ks_2sampResult(statistic=0.012916666666666667, pvalue=1.0)
Ks_2sampResult(statistic=0.013583333333333333, pvalue=1.0)

随机生成了一个有5个特征，包含10000组数据的数据集，划分训练集和测试集后，对比每个特征上测试集和训练集的分布。这里每一个特征都通过了KS检验（这里显然是可以通过的hhh）

应用二：判断二分类模型能否将正负样本很好的分开

在信用评分领域，会使用KS统计量衡量二分类模型分类正负样本的能力。在测试集上，将模型对y_true=1的样本的输出概率值作为data1，对y_true=0的样本的输出概率值作为data2，计算两个分布的KS统计量。我们用 lr 拿上面的数据做个例子。画出测试集上正负样本的预测概率值的分布情况。

lr = LogisticRegression(solver='liblinear')

lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

data1 = np.sort(lr.predict_proba(X_test[y_test==1])[:, 1])

data2 = np.sort(lr.predict_proba(X_test[y_test==1])[:, 0])

plt.figure(figsize=(8, 4))
last, i = 0, 0
while i     plt.plot([last, data1[i]], [i/len(data1), i/len(data1)], 'k')
    if i         last = data1[i]
    i += 1

last, i = 0, 0
while i     plt.plot([last, data2[i]], [i/len(data2), i/len(data2)], 'r')
    if i         last = data2[i]
    i += 1

这两条曲线的最大差值就是我们要求的KS统计量。这个差值越大，说明模型对这个正负样本的区别能力越强。

ks_2samp(data1, data2)

Ks_2sampResult(statistic=0.9219219219219219, pvalue=0.0)

这里KS统计量甚至超过了0.9，一般来说，KS统计量超过0.6，就说明模型的分类能力比较强了。

赞赏作者

推荐阅读：

2020Python招聘内推渠道开启啦！

老司机教你5分钟读懂Python装饰器

用Python实现粒子群算法

抄底美股？用Python分析美股实际收益率

▼点击成为社区会员喜欢就点个在看吧