Py学习  »  机器学习算法

机器学习模型如何处理看不见的数据和看不见的标签?

Prasanth Regupathy • 4 年前 • 599 次点击  

我正试图解决一个文本分类问题。我的标签数量有限,只能捕获文本数据的类别。如果输入的文本数据不符合任何标签,则将其标记为“其他”。在下面的示例中,我构建了一个文本分类器,将文本数据分类为“早餐”或“意大利语”。在测试场景中,我包含了一些不适合我用于培训的标签的文本数据。这是我面临的挑战。理想情况下,我想让模特说“另一个”代表“我喜欢徒步旅行”,“每个人都应该懂数学”。我该怎么做?

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer

X_train = np.array(["coffee is my favorite drink",
                    "i like to have tea in the morning",
                    "i like to eat italian food for dinner",
                    "i had pasta at this restaurant and it was amazing",
                    "pizza at this restaurant is the best in nyc",
                    "people like italian food these days",
                    "i like to have bagels for breakfast",
                    "olive oil is commonly used in italian cooking",
                    "sometimes simple bread and butter works for breakfast",
                    "i liked spaghetti pasta at this italian restaurant"])

y_train_text = ["breakfast","breakfast","italian","italian","italian",
                "italian","breakfast","italian","breakfast","italian"]

X_test = np.array(['this is an amazing italian place. i can go there every day',
                   'i like this place. i get great coffee and tea in the morning',
                   'bagels are great here',
                   'i like hiking',
                   'everyone should understand maths'])

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())])

classifier.fit(X_train, y_train_text)
predicted = classifier.predict(X_test)
proba = classifier.predict_proba(X_test)
print(predicted)
print(proba)

['italian' 'breakfast' 'breakfast' 'italian' 'italian']
[[0.25099411 0.74900589]
 [0.52943091 0.47056909]
 [0.52669142 0.47330858]
 [0.42787443 0.57212557]
 [0.4        0.6       ]]

我认为“另一个”类别是噪音,我不能为这个类别建模。

Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/38089
 
599 次点击  
文章 [ 4 ]  |  最新文章 4 年前