从Pandas数据框中计数不同的单词

7rtdyuoh 于 2023-06-20 发布在其他

关注(0)|答案(9)|浏览(96)

我有一个Pandas数据框，其中一列包含文本。我想得到一个独特的单词列表出现在整个列（空格是唯一的分割）。

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']

df=pd.DataFrame(r1,columns=['text'])

输出应该如下所示：

['my','nickname','is','ft.jgt','someone','going','to','place']

它不会伤害得到一个计数以及，但它不是必需的。

pandas

来源：https://stackoverflow.com/questions/18936957/count-distinct-words-from-a-pandas-data-frame

9条答案

按热度按时间

igetnqfo1#

使用set创建唯一元素序列。
对df进行一些清理，以获得小写字符串并拆分：

df['text'].str.lower().str.split()
Out[43]: 
0             [my, nickname, is, ft.jgt]
1    [someone, is, going, to, my, place]

此列中的每个列表都可以传递给set.update函数以获得唯一的值。使用apply执行以下操作：

results = set()
df['text'].str.lower().str.split().apply(results.update)
print(results)

set(['someone', 'ft.jgt', 'my', 'is', 'to', 'going', 'place', 'nickname'])

或者与Counter()一起使用，来自评论：

from collections import Counter
results = Counter()
df['text'].str.lower().str.split().apply(results.update)
print(results)

赞(0）回复(0）举报 2023-06-20

kninwzqo2#

如果你想从DataFrame构造中执行此操作：

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']

df=pd.DataFrame(r1,columns=['text'])

df.text.apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)

My          1
Someone     1
ft.jgt      1
going       1
is          2
my          1
nickname    1
place       1
to          1
dtype: float64

如果您想要更灵活的标记化，请使用nltk及其tokenize

赞(0）回复(0）举报 2023-06-20

n9vozmp43#

使用collections.Counter：

>>> from collections import Counter
>>> r1=['My nickname is ft.jgt','Someone is going to my place']
>>> Counter(" ".join(r1).split(" ")).items()
[('Someone', 1), ('ft.jgt', 1), ('My', 1), ('is', 2), ('to', 1), ('going', 1), ('place', 1), ('my', 1), ('nickname', 1)]

赞(0）回复(0）举报 2023-06-20

1tu0hz3e4#

基于@Ofir Israel的回答，具体到Pandas：

from collections import Counter
result = Counter(" ".join(df['text'].values.tolist()).split(" ")).items()
result

会给予你你想要的，这将文本列系列值转换为列表，分割空间和计数的示例。

赞(0）回复(0）举报 2023-06-20

0kjbasz65#

uniqueWords = list(set(" ".join(r1).lower().split(" ")))
count = len(uniqueWords)

赞(0）回复(0）举报 2023-06-20

pcww981p6#

下面是在92816行 Dataframe 上提出的三种解决方案（跳过转换到列表）的时序：

from collections import Counter
results = set()

%timeit -n 10 set(" ".join(df['description'].values.tolist()).lower().split(" "))

323 ms ± 4.46 ms/循环（平均值±标准差）运行7次，每次循环10次）

%timeit -n 10 df['description'].str.lower().str.split(" ").apply(results.update)

316 ms ± 4.22 ms/循环（平均值±标准差）运行7次，每次循环10次）

%timeit -n 10 Counter(" ".join(df['description'].str.lower().values.tolist()).split(" "))

365 ms ± 2.5 ms/循环（平均值±标准差）运行7次，每次循环10次）

len(list(set(" ".join(df['description'].values.tolist()).lower().split(" "))))

13561

len(results)

13561

len(Counter(" ".join(df['description'].str.lower().values.tolist()).split(" ")).items())

13561
我也尝试了Pandas唯一的方法，但它花了更长的时间，并使用了> 25 GB的RAM，使我的32 GB笔记本电脑交换。
其他人都很快。我会使用解决方案1作为一个一行，或3如果字数是必要的。

赞(0）回复(0）举报 2023-06-20

yfwxisqw7#

TL;DR

使用collections.Counter获取dataframe中列中唯一字的计数（不含停止字）
给出：

$ cat test.csv 
Description
crazy mind california medical service data base...
california licensed producer recreational & medic...
silicon valley data clients live beyond status...
mycrazynotes inc. announces $144.6 million expans...
leading provider sustainable energy company prod ...
livefreecompany founded 2005, listed new york stock...

代码：

from collections import Counter
from string import punctuation

import pandas as pd

from nltk.corpus import stopwords
from nltk import word_tokenize

stoplist = set(stopwords.words('english') + list(punctuation))

df = pd.read_csv("test.csv", sep='\t')

texts = df['Description'].str.lower()

word_counts = Counter(word_tokenize('\n'.join(texts)))

word_count.most_common()

[out]：

[('...', 6), ('california', 2), ('data', 2), ('crazy', 1), ('mind', 1), ('medical', 1), ('service', 1), ('base', 1), ('licensed', 1), ('producer', 1), ('recreational', 1), ('&', 1), ('medic', 1), ('silicon', 1), ('valley', 1), ('clients', 1), ('live', 1), ('beyond', 1), ('status', 1), ('mycrazynotes', 1), ('inc.', 1), ('announces', 1), ('$', 1), ('144.6', 1), ('million', 1), ('expans', 1), ('leading', 1), ('provider', 1), ('sustainable', 1), ('energy', 1), ('company', 1), ('prod', 1), ('livefreecompany', 1), ('founded', 1), ('2005', 1), (',', 1), ('listed', 1), ('new', 1), ('york', 1), ('stock', 1)]

赞(0）回复(0）举报 2023-06-20

bzzcjhmw8#

我还没有在这里看到这个方法，它是纯pandas，使用了pd.DataFrame.explode()。Explode将列表中的每个元素转换为与原始行共享ID的行。

# Get all unique words
df['text'].str.split().explode().unique()

# Get all unique words with frequency counts
df['text'].str.split().explode().value_counts()

r1 = ['My nickname is ft.jgt','Someone is going to my place']

df = pd.DataFrame(r1,columns=['text'])
df['text'].str.split().explode().value_counts()
>>>
text
is          2
My          1
nickname    1
ft.jgt      1
Someone     1
going       1
to          1
my          1
place       1
Name: count, dtype: int64

赞(0）回复(0）举报 2023-06-20

vlf7wbxs9#

如果Dataframe有'a'，' b'，'c'等列，并且要计算每列的不同单词，则可以使用