苹果手机评论情感分析(附python源码和评论数据)

大数据挖掘DT数据分析公众号： datadw

首先抓取网页上的数据，每一页十条评论，生成为一个txt文件。

数据链接

回复公众号 datadw 关键字“苹果”获取。

以下采用既有词典的方式：

准备四本词典，停用词，否定词，程度副词，情感词，链接也给出来：

回复公众号 datadw 关键字“苹果”获取。

[python] view plain copy

导入数据并且分词

[python] view plain copy

计算一下得分，注意，程度副词和否定词只修饰后面的情感词，这是缺点之一，之二是无法判断某些贬义词其实是褒义的，之三是句子越长得分高的可能性比较大，在此可能应该出去词的总数。

[python] view plain copy

[python] view plain copy

排序之后图标如下，可以看出积极正面的得分比较多，负面的比较少，根据原网页的评分确实如此，然而点评为1星的有1半得分为正，点评为5星的有四分之一得分为负。基于词典的方式严重依赖词典的质量，以及这种方式的缺点都可能造成得分的偏差，所以接下来打算利用word2vec试试。

词向量的变换方式如下：

[python] view plain copy

from gensim.models import word2vec
import logging
logging.basicConfig(format = '%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO)
sentences = word2vec.Text8Corpus("corpus.csv") # 加载语料
model = word2vec.Word2Vec(sentences, size = 400) # 训练skip-gram模型,根据单词寻找周边词
# 保存模型，以便重用
model.save("corpus.model")
# 对应的加载方式
# model = word2vec.Word2Vec.load("corpus.model")
from gensim.models import word2vec
# load word2vec model
model = word2vec.Word2Vec.load("corpus.model")
model.save_word2vec_format("corpus.model.bin", binary = True)
model = word2vec.Word2Vec.load_word2vec_format("corpus.model.bin", binary = True)

加载一下评分

[python] view plain copy

转换成词向量，发现里面有2个失败并且删除

[python] view plain copy

import numpy as np
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
def getWordVecs(wordList):
vecs = []
for word in wordList:
try:
vecs.append(model[word])
except KeyError:
continue
return np.array(vecs, dtype = 'float')
def buildVecs(list):
posInput = []
# print txtfile
for line in list:
# print u"第",id,u"条"
resultList = getWordVecs(line)
# for each sentence, the mean vector of all its vectors is used to represent this sentence
if len(resultList) != 0:
resultArray = sum(np.array(resultList))/len(resultList)
posInput.append(resultArray)
else:
return posInput
X = np.array(buildVecs(t))
#327 408失败
del(y[326])
del(y[407])
y = np.array(y)

PCA降维并运用SVM进行分类

[python] view plain copy

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# Plot the PCA spectrum
pca = PCA(n_components=400)
pca.fit(X)
plt.figure(1, figsize=(4, 3))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')
X_reduced = PCA(n_components = 100).fit_transform(X)
from sklearn.cross_validation import train_test_split
X_reduced_train,X_reduced_test,y_reduced_train,y_reduced_test= train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn.svm import SVC
from sklearn import metrics#准确度
clf = SVC(C = 2, probability = True)
clf.fit(X_reduced_train, y_reduced_train)
pred_probas = clf.predict(X_reduced_test)
scores =[]
scores.append(metrics.accuracy_score(pred_probas, y_reduced_test))
print scores

降维后的准确度为auc=0.83，相比MLP神经网络的准确度0.823来说结果差不多，以下是MLP的代码。对于利用word2vec来说，其结果依赖于语料库的词语量大小，我打印了部分失败的词语如下，表明在语料库中并没有找到相关的词，导致向量的表达信息有所缺失。

[python] view plain copy

原文：http://blog.csdn.net/Jemila/article/details/62887907?locationNum=7&fps=1

人工智能大数据与深度学习

搜索添加微信公众号：weic2c

长按图片，识别二维码，点关注

大数据挖掘DT数据分析

搜索添加微信公众号：datadw

教你机器学习，教你数据挖掘

长按图片，识别二维码，点关注

今天看啥 - 高品质阅读平台
本文地址：http://www.jintiankansha.me/t/MlJ35CCJD0