python WordCloud：显示二元组时出现问题

j2cgzkjk 于 2023-08-02 发布在 Python

关注(0)|答案(1)|浏览(107)

我想从废弃的Twitter数据中实现一个词云。问题是单词states出现了214次，而state - 64次。只有一条推文中出现了“United States”字样的组合。尽管如此，我的词云是由这个组合而不是正确的组合形成的。
我的生成世界云的代码：

raw_tweets = []

STOPWORDS = [
    'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your',
    'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her',
    'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs',
    'themselves', 'what', 'which', 'who', 'would', 'whom', 'this', 'that', 'these', 'those',
    'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
    'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if',
    'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with',
    'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
    'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over',
    'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where',
    'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
    'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too',
    'very', 't', 'can', 'will', 'just', 'don', 'should', 'now'
]

for tweet in df['Tweet']:
    raw_tweets.append(tweet)

raw_string = ''.join(raw_tweets)
no_links = re.sub(r'http\S+', '', raw_string)
no_unicode = re.sub(r"\\[a-z][a-z]?[0-9]+", '', no_links)
no_special_characters = re.sub('[^A-Za-z ]+', '', no_unicode)

words = no_special_characters.split(" ")
words = [w for w in words if len(w) > 2]
words = [w.lower() for w in words]

import numpy as np
import matplotlib.pyplot as plt
import re
from PIL import Image
from wordcloud import WordCloud
from IPython.display import Image as im

mask = np.array(Image.open('Logo_location')) 

wc = WordCloud(background_color="white", max_words=2000, mask=mask, stopwords=STOPWORDS, relative_scaling=1)
wc.generate(','.join(words))

f = plt.figure(figsize=(13,13))
plt.imshow(wc, interpolation='bilinear')
plt.title('Twitter Generated Cloud', size=30)
plt.axis("off")
plt.show()

字符串
生成词云;
x1c 0d1x的数据

python

来源：https://stackoverflow.com/questions/76794846/wordcloud-problem-with-displaying-bigrams

1条答案

按热度按时间

1sbrub3j1#

“states”这个词出现了214次，而“state”这个词出现了64次。只有一条推文中出现了“美国”字样的组合。尽管如此，我的词云是由这种组合形成的。
您正在生成一个单词云，而不是关键短语云。对于这个特定的掩码，它只是碰巧并排，但是对于不同的掩码，结果 * 可能 * 不同。此外，您正在对join("")艾德tweet执行.split(" ")，因此输出已经是一个单词列表。（我强烈建议你使用join(" ")，否则，推文的结尾和开头会融合在一起。）
您当前的代码不包含包含两个单词的短语，例如“United States”。如果您想包含它们，您可以：

phrases = [words[i]+' '+words[i+1] for i in range(0, len(words)-1)]

字符串
如果要排除出现次数不少于一次的短语：

unique_phrases = set(phrases) 
repeated_phrases = [] 
for phrase in unique_phrases:
    if " ".join(words).count(phrase) > 1:
        repeated_phrases.append(phrase)

型
组合，用于输入：

tweets = ["I live in the states", "Stack Overflow", "United States of America", "Stack" ,"United States", "Overflow", "State", "States", "States of America"]

型
输出将是：

repeated_phrases = ['states of', 'united states', 'of america']

型
最后，如果您连接words和repeated_phrases，然后生成一个单词云，则输出将包括“State”和“United States”。你可能会想尝试一下重复短语的阈值，因为1太低了，但对我的简短示例来说是有效的。
编辑;文档中提到collocations参数，它为给定的输入生成二元组。你也可以将words作为wc.generate(" ".join(words))传递，这将在默认情况下生成二元组，但仍然会有很多视觉上小而无意义的二元组，如“of states”等。

赞(0）回复(0）举报 2023-08-02

我来回答

python WordCloud：显示二元组时出现问题

1条答案

相关问题

热门标签

最新问答