pandas 计算每行的字数

icnyk63a 于 2023-04-04 发布在其他

关注(0)|答案(6)|浏览(135)

我试图在DataFrame中创建一个新列，其中包含相应行的单词计数。我正在寻找单词的总数，而不是每个不同单词的频率。我以为会有一个简单/快速的方法来完成这个常见的任务，但在谷歌上搜索并阅读了一些SO帖子之后（1，2，3，4）我被卡住了。我已经尝试了链接SO帖子中提出的解决方案，但得到了很多属性错误。

words = df['col'].split()
df['totalwords'] = len(words)

导致

AttributeError: 'Series' object has no attribute 'split'

和

f = lambda x: len(x["col"].split()) -1
df['totalwords'] = df.apply(f, axis=1)

导致

AttributeError: ("'list' object has no attribute 'split'", 'occurred at index 0')

pandas

来源：https://stackoverflow.com/questions/49984905/count-number-of-words-per-row

6条答案

按热度按时间

nwnhqdif1#

`str.split` + `str.len`

str.len可以很好地处理任何非数字列。

df['totalwords'] = df['col'].str.split().str.len()

`str.count`

如果你的单词是单空格分隔的，你可以简单地把空格数加1。

df['totalwords'] = df['col'].str.count(' ') + 1

列表解析

这比你想象的要快！

df['totalwords'] = [len(x.split()) for x in df['col'].tolist()]

赞(0）回复(0）举报 2023-04-04

olqngx592#

下面是使用.apply()的方法：

df['number_of_words'] = df.col.apply(lambda x: len(x.split()))

示例

假设df：

>>> df
                    col
0  This is one sentence
1           and another

应用.apply()后

df['number_of_words'] = df.col.apply(lambda x: len(x.split()))

>>> df
                    col  number_of_words
0  This is one sentence                4
1           and another                2

注意：正如评论和this answer中指出的，.apply不一定是最快的方法，如果速度很重要，最好使用@c s的方法。

赞(0）回复(0）举报 2023-04-04

cxfofazt3#

这是使用pd.Series.str.split和pd.Series.map的一种方法：

df['word_count'] = df['col'].str.split().map(len)

上面假设df['col']是一系列字符串。
示例：

df = pd.DataFrame({'col': ['This is an example', 'This is another', 'A third']})

df['word_count'] = df['col'].str.split().map(len)

print(df)

#                   col  word_count
# 0  This is an example           4
# 1     This is another           3
# 2             A third           2

赞(0）回复(0）举报 2023-04-04

omtl5h9j4#

使用来自cold的list和map数据

list(map(lambda x : len(x.split()),df.col))
Out[343]: [4, 3, 2]

赞(0）回复(0）举报 2023-04-04

wtzytmuj5#

你也可以mapsplit和len方法到DataFrame列中的字符串：

df['word_count'] = [*map(len, map(str.split, df['col'].tolist()))]

这里给出了一些初步的基准测试答案。map似乎在非常大的系列上做得很好：

df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 
                   'one banana', 'fruits']*100000, 
                  columns=['col'])
>>> df.shape
(600000, 1)

>>> %timeit df['word_count'] = df['col'].str.split().str.len()
761 ms ± 43.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['word_count'] = df['col'].str.count(' ').add(1)
691 ms ± 71.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['word_count'] = [len(x.split()) for x in df['col'].tolist()]
405 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['word_count'] = df['col'].apply(lambda x: len(x.split()))
450 ms ± 22.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['word_count'] = df['col'].str.split().map(len)
657 ms ± 27.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['word_count'] = list(map(lambda x : len(x.split()), df['col'].tolist()))
435 ms ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df['word_count'] = [*map(len, map(str.split, df['col'].tolist()))]
329 ms ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

赞(0）回复(0）举报 2023-04-04

w80xi6nr6#

你可以在Pandas内置的str.count（）方法中使用一个简单的正则表达式：

df['total_words'] = df['col'].str.count('\w+')

\w字符类匹配任何单词字符，包括任何字母、数字或下划线。它等效于字符范围[A-Za-z 0 -9_]。
+符号表示1次或无限次重复。

如果您希望单词仅由字母符号组成，请使用以下正则表达式：

df['total_words'] = df['col'].str.count('[A-Za-z]+')

赞(0）回复(0）举报 2023-04-04

我来回答

pandas 计算每行的字数

6条答案

`str.split` + `str.len`

`str.count`

列表解析

相关问题

热门标签

最新问答

pandas 计算每行的字数

6条答案

str.split + str.len

str.count

列表解析

相关问题

热门标签

最新问答

`str.split` + `str.len`

`str.count`