在单词python列上使用wordcloud

xbp102n0  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(238)

我正在使用pyspark df,如下所示:

+-------------+-----+-----+------+
|        words|    A|    B|     C|
+-------------+-----+-----+------+
|        write|  1.0|2.083| 2.083|
|        trade|0.485|4.148| 2.012|
|        elite|0.333|5.969| 1.988|
|         mark|  0.5|3.897| 1.949|
|         quot|0.439|4.227| 1.856|
|     prostate| 0.25|7.416| 1.854|
|         maya| 0.25|7.416| 1.854|
|    lafayette|0.222|8.109|   1.8|
|       detail|  1.0|1.789| 1.789|
|        punta|  0.2|8.802|  1.76|
|scorbutically|  0.2|8.802|  1.76|

df.dtypes

[('words', 'string'),
 ('A', 'double'),
 ('B', 'double'),
 ('C', 'double')]

我想做一个专栏文章 words 基于列的值 C 也就是说,c栏中的值越高的词应该看起来越大,这反映了它们的使用频率。
有可能吗?
有什么建议吗?

9w11ddsr

9w11ddsr1#

您可以尝试:

import matplotlib.pyplot as plt
from wordcloud import WordCloud
from collections import ChainMap
import pyspark.sql.functions as F

wordcloud = WordCloud(background_color="white")

words = dict(ChainMap(*df.select(F.create_map('words', 'C')).rdd.map(lambda x: x[0]).collect()))

# {'scorbutically': 1.76, 'punta': 1.76, 'detail': 1.789, 'lafayette': 1.8, 'maya': 1.854, 'prostate': 1.854, 'quot': 1.856, 'mark': 1.949, 'elite': 1.988, 'trade': 2.012, 'write': 2.083}

plt.imshow(wordcloud.generate_from_frequencies(words))

plt.show()

相关问题