numpy 使用scikit learn中的计数和tfidf作为特征

wz3gfoph 于 2023-01-02 发布在其他

关注(0)|答案(1)|浏览(94)

我尝试使用计数和tfidf作为多项式朴素贝叶斯模型的特征。

text = ["this is spam", "this isn't spam"]
labels = [0,1]
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)

tf_transformer = TfidfTransformer(use_idf=True)
combined_features = FeatureUnion([("counts", self.count_vectorizer), ("tfidf", tf_transformer)]).fit(self.text)

classifier = MultinomialNB()
classifier.fit(combined_features, labels)

但是我在使用FeatureUnion和tfidf时遇到错误：

TypeError: no supported conversion for types: (dtype('S18413'),)

知道为什么会发生这种情况吗？不可能同时拥有count和TFIDF作为特性吗？

numpy

来源：https://stackoverflow.com/questions/27260799/using-counts-and-tfidf-as-features-with-scikit-learn

1条答案

按热度按时间

nlejzf6q1#

错误不是来自FeatureUnion，而是来自TfidfTransformer
您应该使用TfidfVectorizer而不是TfidfTransformer，转换器需要numpy数组作为输入，而不是纯文本，因此会出现TypeError
此外，您的测试句子对于Tfidf测试来说太小了，所以请尝试使用更大的句子，下面是一个示例：

from nltk.corpus import brown

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.naive_bayes import MultinomialNB

# Let's get more text from NLTK
text = [" ".join(i) for i in brown.sents()[:100]]
# I'm just gonna assign random tags.
labels = ['yes']*50 + ['no']*50
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)
tf_transformer = TfidfVectorizer(use_idf=True)
combined_features = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer)]).fit_transform(text)
classifier = MultinomialNB()
classifier.fit(combined_features, labels)

赞(0）回复(0）举报 2023-01-02

我来回答

numpy 使用scikit learn中的计数和tfidf作为特征

1条答案

相关问题

热门标签

最新问答