【机器学习】详解高斯混合模型

当我们谈论高斯混合模型（后面，简写为 GMM）时，了解 KMeans 算法的工作原理是必不可少的。因为 GMM 与 KMeans 非常相似，所以它更可能是 KMeans 的概率版本。这种概率特征使 GMM 可以应用于 KMeans 无法解决的许多复杂问题。

总之，KMeans 有以下限制：

1.它假设类簇是球形的并且大小相同，这在大多数现实世界的场景中是无效的。

2.它是一种硬聚类方法。这意味着每个数据点只分配给一个簇。

由于这些限制，我们在进行机器学习项目时应该了解 KMeans 的替代方案。在本文中，我们将探索 KMeans 聚类的最佳替代方案之一，称为高斯混合模型。

通过这篇文章，我们将了解以下几点。

1.高斯混合模型 (GMM) 算法的工作原理——通俗易懂。

2.GMM 背后的数学。

3.从头开始使用 Python 实现 GMM。

高斯混合模型 (GMM) 算法的工作原理

正如我前面提到的，我们可以将 GMM 称之为概率版本的 KMeans，因为 KMeans 和 GMM 的初始化和训练过程是相同的。但是，KMeans 使用基于距离的方法，而 GMM 使用概率方法。GMM 中有一个主要假设：数据集由多个高斯分布生成，换句话说，是高斯分布的混合。

上述分布通常称为多模型分布。每个峰代表我们数据集中不同的高斯分布或聚类。但问题是，

我们如何估计这些分布？

在回答这个问题之前，让我们先创建一些高斯分布。请注意，我正在生成多元正态分布；它是单变量正态分布的更高维扩展。

让我们定义数据点的均值和协方差。使用均值和协方差，我们可以生成如下分布。

# Set the mean and covariancemean1 = [0, 0]mean2 = [2, 0]cov1 = [[1, .7], [.7, 1]]cov2 = [[.5, .4], [.4, .5]]
# Generate data from the mean and covariancedata1 = np.random.multivariate_normal(mean1, cov1, size=1000)data2 = np.random.multivariate_normal(mean2, cov2, size=1000)

让我们绘制数据

plt.figure(figsize=(10,6))
plt.scatter(data1[:,0],data1[:,1])plt.scatter(data2[:,0],data2[:,1])
sns.kdeplot(data1[:, 0], data1[:, 1], levels=20, linewidth=10, color='k', alpha=0.2)sns.kdeplot(data2[:, 0], data2[:, 1], levels=20, linewidth=10, color='k', alpha=0.2)
plt.grid(False)plt.show()

正如您在这里看到的，我们使用均值和协方差矩阵生成了随机高斯分布。如果反转这个过程怎么样？这正是 GMM 正在做的事情。但是如何实现呢？

因为，一开始，我们对于类簇以及相关的均值和协方差矩阵没有任何了解。

好吧，它是根据以下步骤发生的，

1.为给定的数据集确定聚类的数量（为此，我们可以使用领域知识或其他方法，例如 BIC/AIC）。假设我们有 1000 个数据点，我们将组数设置为 2。

2.初始化每个集群的均值、协方差和权重参数。（我们将在后面的部分中对此进行更多探讨）

3.使用期望最大化算法执行以下操作，

期望步骤（E步）：计算每个数据点属于每个分布的概率，然后使用参数的当前估计评估似然函数
最大化步骤（M步骤）：更新之前的均值、协方差和权重参数，以最大化E步骤中找到的期望似然函数
重复这些步骤，直到模型收敛。

有了这些信息，就结束了对 GMM 算法的非数学解释。

GMM 背后的数学

GMM 的核心在于上一节中描述的期望最大化 (EM) 算法。让我们演示一下 EM 算法在 GMM 中的应用。

步骤01：初始化均值、协方差和权重参数

1.mean (μ): 随机初始化。

2.协方差 (Σ)：随机初始化

3.权重（混合系数）(π)：每个类的分数是指特定数据点属于每个类的可能性。一开始，这对所有类簇都是相同的。假设我们用三个分量拟合 GMM。在这种情况下，每个组件的权重参数可能设置为 1/3，从而导致概率分布为 (1/3, 1/3, 1/3)。

步骤02：期望步骤（E步骤）

对于每个数据点𝑥𝑖：

使用以下等式计算数据点属于簇 (𝑐) 的概率。k 分布的个数。

其中𝜋_𝑐是高斯分布c的混合系数（有时称为权重），它在前一阶段被初始化，𝑁（𝒙|𝝁，𝚺）描述了高斯分布的概率密度函数（PDF），均值为𝜇和关于数据点 x 的协方差 Σ；我们可以如下表示。

E-step 使用模型参数的当前估计值计算这些概率。这些概率通常称为高斯分布的“责任”。它们由变量 r_ic 表示，其中 i 是数据点的索引，c 是高斯分布的索引。责任衡量第 c 个高斯分布生成第 i 个数据点的可能性。这里使用条件概率，更具体地说，是贝叶斯定理。

让我们举一个简单的例子。假设我们有 100 个数据点，需要将它们聚类成两组。我们可以这样写 r_ic(i=20,c=1) 。其中 i 代表数据点的索引，c 代表我们正在考虑的簇的索引。

请注意，在开始时，𝜋_𝑐 初始化为每个簇 c = 1,2,3,..,k 相等。在我们的例子中，𝜋_1 = 𝜋_2 = 1/2。

E-step 的结果是数值集合，集合中每个值代表每个数据点和每个高斯分布的责任大小。这些责任在 M 步中用于更新模型参数的估计。

步骤03：最大化步骤（M步）

在此步骤中，算法使用高斯分布的责任（在 E 步中计算）来更新模型参数的估计值。

M-step按如下方式更新参数的估计：

1.使用上面的公式4 更新 πc（混合系数）。

2.使用上面公式 5 更新 μc。

3.然后使用公式6 更新 Σc。

这个更新的估计在下一个 E-step 中用于计算数据点的新的责任值。

依此类推，此过程将重复直到算法收敛，也就是模型参数从一次迭代到下一次迭代没有显着变化。

让我们将以上内容总结成一个简单的图表，

在编码方面不用担心; 每个方程式只需要一行代码就能实现。让我们开始使用 Python 从头开始实现 GMM。

从头开始使用 Python 实现 GMM。

首先，让我们创建一个假数据集。在本节中，我将为一维数据集实现 GMM。

import numpy as np
n_samples = 100mu1, sigma1 = -5, 1.2 mu2, sigma2 = 5, 1.8 mu3, sigma3 = 0, 1.6 
x1 = np.random.normal(loc = mu1, scale = np.sqrt(sigma1), size = n_samples)x2 = np.random.normal(loc = mu2, scale = np.sqrt(sigma2), size = n_samples)x3 = np.random.normal(loc = mu3, scale = np.sqrt(sigma3), size = n_samples)
X = np.concatenate((x1,x2,x3))

让我们创建一个辅助函数来绘制我们的数据。

from scipy.stats import norm
def plot_pdf(mu,sigma,label,alpha=0.5,linestyle='k--',density=True):    """    Plot 1-D data and its PDF curve.
    """    # Compute the mean and standard deviation of the data
    # Plot the data        X = norm.rvs(mu, sigma, size=1000)        plt.hist(X, bins=50, density=density, alpha=alpha,label=label)
    # Plot the PDF    x = np.linspace(X.min(), X.max(), 1000)    y = norm.pdf(x, mu, sigma)    plt.plot(x, y, linestyle)

绘制生成的数据如下。请注意，我没有绘制数据本身，而是绘制了每个样本的概率密度。

plot_pdf(mu1,sigma1,label=r"$\mu={} \ ; \ \sigma={}$".format(mu1,sigma1))plot_pdf(mu2,sigma2,label=r"$\mu={} \ ; \ \sigma={}$".format(mu2,sigma2))plot_pdf(mu3,sigma3,label=r"$\mu={} \ ; \ \sigma={}$".format(mu3,sigma3))plt.legend()plt.show()

让我们构建上一节中描述的每个步骤，

步骤 01：初始化均值、协方差和权重

def random_init(n_compenents):        """Initialize means, weights and variance randomly       and plot the initialization    """        pi = np.ones((n_compenents)) / n_compenents    means = np.random.choice(X, n_compenents)    variances = np.random.random_sample(size=n_compenents)
    plot_pdf(means[0],variances[0],'Random Init 01')    plot_pdf(means[1],variances[1],'Random Init 02')    plot_pdf(means[2],variances[2],'Random Init 03')        plt.legend()    plt.show()        return means,variances,pi

步骤02：期望步骤（E步骤）

def step_expectation(X,n_components,means,variances):    """E Step        Parameters    ----------    X : array-like, shape (n_samples,)        The data.    n_components : int        The number of clusters    means : array-like, shape (n_components,)        The means of each mixture component.    variances : array-like, shape (n_components,)        The variances of each mixture component.            Returns    -------    weights : array-like, shape (n_components,n_samples)    """    weights = np.zeros((n_components,len(X)))    for j in range(n_components):        weights[j,:] = norm(loc=means[j],scale=np.sqrt(variances[j])).pdf(X)    return weights

在这个函数之后，我们介绍了我们在 E 步骤中讨论的前两个方程。在这里，我们为当前模型参数均值和方差生成了高斯分布。我们通过使用 scipy 的 stat 模块实现了这一点。之后，我们使用pdf方法计算属于每个簇的每个数据点的可能性。

步骤03：最大化步骤（M步）

def step_maximization(X,weights,means,variances,n_compenents,pi):    """M Step        Parameters    ----------    X : array-like, shape (n_samples,)        The data.    weights : array-like, shape (n_components,n_samples)        initilized weights array    means : array-like, shape (n_components,)        The means of each mixture component.    variances : array-like, shape (n_components,)        The variances of each mixture component.    n_components : int        The number of clusters    pi: array-like (n_components,)        mixture component weights            Returns    -------    means : array-like, shape (n_components,)        The means of each mixture component.    variances : array-like, shape (n_components,)        The variances of each mixture component.    """    r = []    for j in range(n_compenents):  
        r.append((weights[j] * pi[j]) / (np.sum([weights[i] * pi[i] for i in range(n_compenents)], axis=0)))
        #5th equation above        means[j] = np.sum(r[j] * X) / (np.sum(r[j]))                #6th equation above        variances[j] = np.sum(r[j] * np.square(X - means[j])) / (np.sum(r[j]))                #4th equation above        pi[j] = np.mean(r[j])
    return variances,means,pi

让我们实现训练循环。

def train_gmm(data,n_compenents=3,n_steps=50, plot_intermediate_steps_flag=True):    """ Training step of the GMM model        Parameters    ----------    data : array-like, shape (n_samples,)        The data.    n_components : int        The number of clusters    n_steps: int        number of iterations to run    """        #intilize model parameters at the start    means,variances,pi = random_init(n_compenents)
    for step in range(n_steps):        #perform E step        weights = step_expectation(data,n_compenents,means,variances)        #perform M step        variances,means,pi = step_maximization(X, weights, means, variances, n_compenents, pi)
    plot_pdf(means,variances)

当我们开始模型训练时，我们会根据我们设置的n_steps参数做E步和M步。但在实际用例中，你会更频繁地使用 GMM 的 scikit-learn 版本。在那里您可以找到其他参数，例如

tol：定义模型的停止标准。当下限平均增益低于 tol 参数时，EM 迭代将停止。init_params：用于初始化权重的方法

好吧，让我们看看我们手工制作的 GMM 表现如何。

在上图中，红色虚线表示原始分布，而其他图表示学习到的分布。第 30 次迭代后，我们可以看到我们的模型在这个玩具数据集上表现良好。

总结

本文旨在对高斯混合模型进行全面的指导；但是，我们鼓励读者尝试不同的机器学习算法，因为没有一种最佳算法可以很好地解决所有问题。此外，我们选择的机器学习算法的复杂性也值得注意。 GMM 的一个常见问题是它不能很好地扩展到大型数据集。





    
往期精彩回顾



    




适合初学者入门人工智能的路线及资料下载



    
(图文+视频)机器学习入门系列下载
机器学习及深度学习笔记等资料打印
《统计学习方法》的代码复现专辑

机器学习交流qq群955171419，加入微信群请扫码