想要使用python或pandas计算跳过第一个块的文本数量[已关闭]

ia2d9nvy  于 2023-06-20  发布在  Python
关注(0)|答案(1)|浏览(99)

已关闭,此问题需要details or clarity。目前不接受答复。
**想改善这个问题吗?**通过editing this post添加详细信息并澄清问题。

10天前关闭。
Improve this question

dataframe                        col1
 index0   '......kkkkkkk......kkkkkkkkk.....kkkkkkkkkkkk'
 index1   '......kkkkkkkkk.........kkkkkkk.......kkkkkkkk...

     
  result        1st block of'k'       2nd block of '.'
 index0             7                         6
 index1             9                         9

这里有文本,但我想计算'k' = 7的第一个块的数量,以及'.' = 6的第二个块的数量。@index0和9以及index 1的9

vdzxcuhz

vdzxcuhz1#

更新问题(仅python)

你可以使用一个简单的列表解析和re.finditer

import re

a = 'ooooooookkkkkkkoooookkkkkkkkkooooooooookkkkkkkkkkkk'

out = [len(m.group()) for m in re.finditer(r'k+', a)]

itertools.groupby

from itertools import groupby

out = [len(list(g)) for k, g in groupby(a, key=lambda x: x == 'k') if k]

输出:

[7, 9, 12]
原题(pandas)

假设这样的输入:

col
0   ......kkkkkkk.........kkkkkkkkk.............kkkkkkkkkkkk
1      ......kkk..kkkk............kkkkkkkkk.........kkkkkkkk

您可以使用extractallstr.len,然后选择unstack

df['col'].str.extractall('(k+)')[0].str.len().unstack('match', fill_value=0)

输出:

match  0  1   2  3
0      7  9  12  0
1      3  4   9  8

或者:

import re
df['sizes'] = [[len(m.group()) for m in re.finditer(r'k+', s)]
               for s in df['col']]

输出:

col         sizes
0   ......kkkkkkk.........kkkkkkkkk.............k...    [7, 9, 12]
1   ......kkk..kkkk............kkkkkkkkk............  [3, 4, 9, 8]

相关问题