从Pandas数据框中计数不同的单词

7rtdyuoh  于 2023-06-20  发布在  其他
关注(0)|答案(9)|浏览(93)

我有一个Pandas数据框,其中一列包含文本。我想得到一个独特的单词列表出现在整个列(空格是唯一的分割)。

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']

df=pd.DataFrame(r1,columns=['text'])

输出应该如下所示:

['my','nickname','is','ft.jgt','someone','going','to','place']

它不会伤害得到一个计数以及,但它不是必需的。

igetnqfo

igetnqfo1#

使用set创建唯一元素序列。
df进行一些清理,以获得小写字符串并拆分:

df['text'].str.lower().str.split()
Out[43]: 
0             [my, nickname, is, ft.jgt]
1    [someone, is, going, to, my, place]

此列中的每个列表都可以传递给set.update函数以获得唯一的值。使用apply执行以下操作:

results = set()
df['text'].str.lower().str.split().apply(results.update)
print(results)

set(['someone', 'ft.jgt', 'my', 'is', 'to', 'going', 'place', 'nickname'])

或者与Counter()一起使用,来自评论:

from collections import Counter
results = Counter()
df['text'].str.lower().str.split().apply(results.update)
print(results)
kninwzqo

kninwzqo2#

如果你想从DataFrame构造中执行此操作:

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']

df=pd.DataFrame(r1,columns=['text'])

df.text.apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)

My          1
Someone     1
ft.jgt      1
going       1
is          2
my          1
nickname    1
place       1
to          1
dtype: float64

如果您想要更灵活的标记化,请使用nltk及其tokenize

n9vozmp4

n9vozmp43#

使用collections.Counter

>>> from collections import Counter
>>> r1=['My nickname is ft.jgt','Someone is going to my place']
>>> Counter(" ".join(r1).split(" ")).items()
[('Someone', 1), ('ft.jgt', 1), ('My', 1), ('is', 2), ('to', 1), ('going', 1), ('place', 1), ('my', 1), ('nickname', 1)]
1tu0hz3e

1tu0hz3e4#

基于@Ofir Israel的回答,具体到Pandas:

from collections import Counter
result = Counter(" ".join(df['text'].values.tolist()).split(" ")).items()
result

会给予你你想要的,这将文本列系列值转换为列表,分割空间和计数的示例。

0kjbasz6

0kjbasz65#

uniqueWords = list(set(" ".join(r1).lower().split(" ")))
count = len(uniqueWords)
pcww981p

pcww981p6#

下面是在92816行 Dataframe 上提出的三种解决方案(跳过转换到列表)的时序:

from collections import Counter
results = set()

%timeit -n 10 set(" ".join(df['description'].values.tolist()).lower().split(" "))

323 ms ± 4.46 ms/循环(平均值±标准差)运行7次,每次循环10次)

%timeit -n 10 df['description'].str.lower().str.split(" ").apply(results.update)

316 ms ± 4.22 ms/循环(平均值±标准差)运行7次,每次循环10次)

%timeit -n 10 Counter(" ".join(df['description'].str.lower().values.tolist()).split(" "))

365 ms ± 2.5 ms/循环(平均值±标准差)运行7次,每次循环10次)

len(list(set(" ".join(df['description'].values.tolist()).lower().split(" "))))

13561

len(results)

13561

len(Counter(" ".join(df['description'].str.lower().values.tolist()).split(" ")).items())

13561
我也尝试了Pandas唯一的方法,但它花了更长的时间,并使用了> 25 GB的RAM,使我的32 GB笔记本电脑交换。
其他人都很快。我会使用解决方案1作为一个一行,或3如果字数是必要的。

yfwxisqw

yfwxisqw7#

TL;DR

使用collections.Counter获取dataframe中列中唯一字的计数(不含停止字)
给出:

$ cat test.csv 
Description
crazy mind california medical service data base...
california licensed producer recreational & medic...
silicon valley data clients live beyond status...
mycrazynotes inc. announces $144.6 million expans...
leading provider sustainable energy company prod ...
livefreecompany founded 2005, listed new york stock...

代码:

from collections import Counter
from string import punctuation

import pandas as pd

from nltk.corpus import stopwords
from nltk import word_tokenize

stoplist = set(stopwords.words('english') + list(punctuation))

df = pd.read_csv("test.csv", sep='\t')

texts = df['Description'].str.lower()

word_counts = Counter(word_tokenize('\n'.join(texts)))

word_count.most_common()

[out]:

[('...', 6), ('california', 2), ('data', 2), ('crazy', 1), ('mind', 1), ('medical', 1), ('service', 1), ('base', 1), ('licensed', 1), ('producer', 1), ('recreational', 1), ('&', 1), ('medic', 1), ('silicon', 1), ('valley', 1), ('clients', 1), ('live', 1), ('beyond', 1), ('status', 1), ('mycrazynotes', 1), ('inc.', 1), ('announces', 1), ('$', 1), ('144.6', 1), ('million', 1), ('expans', 1), ('leading', 1), ('provider', 1), ('sustainable', 1), ('energy', 1), ('company', 1), ('prod', 1), ('livefreecompany', 1), ('founded', 1), ('2005', 1), (',', 1), ('listed', 1), ('new', 1), ('york', 1), ('stock', 1)]
bzzcjhmw

bzzcjhmw8#

我还没有在这里看到这个方法,它是纯pandas,使用了pd.DataFrame.explode()。Explode将列表中的每个元素转换为与原始行共享ID的行。

# Get all unique words
df['text'].str.split().explode().unique()

# Get all unique words with frequency counts
df['text'].str.split().explode().value_counts()
r1 = ['My nickname is ft.jgt','Someone is going to my place']

df = pd.DataFrame(r1,columns=['text'])
df['text'].str.split().explode().value_counts()
>>>
text
is          2
My          1
nickname    1
ft.jgt      1
Someone     1
going       1
to          1
my          1
place       1
Name: count, dtype: int64
vlf7wbxs

vlf7wbxs9#

如果Dataframe有'a',' b','c'等列,并且要计算每列的不同单词,则可以使用

Counter(dataframe['a']).items()

相关问题