pandas 提取列中具有相似值的连续行,更多的是使用特定的面片大小

nzkunb0c  于 2023-10-14  发布在  其他
关注(0)|答案(2)|浏览(84)

我正在寻找提取连续的行,指定的文本连续重复超过5次。
例如:

A  B   C 
10  john    1
12  paul    1
23  kishan  1
12  teja    1
12  zebo    1
324 vauh    -1
3434    krish   -1
232 poo -1
4535    zoo 1
4343    doo 1
342 foo -1
123 soo 1
121 koo -1
34  loo -1
343454  moo -1
565343  noo -1
2323234 voo -1
3434    coo 1
545 xoo 1
6565    zoo 1
232321  qoo 1
34454   woo 1
546556  eoo 1
65665   roo -1
5343    too -1
3232    yoo 1
1212    uoo 1
23355667    ioo 1
787878  joo -1

我正在寻找下面的结果,其中列值'c'有连续的1重复超过4次作为不同的组。
输出量:

A   B   C   group
10  john    1   1
12  paul    1   1
23  kishan  1   1
12  teja    1   1
12  zebo    1   1
3434 coo    1   2
545  xoo    1   2
6565    zoo 1   2
232321  qoo 1   2
34454   woo 1   2
546556  eoo 1   2
tktrz96b

tktrz96b1#

使用掩码和factorize

# identify 1s
m = df['C'].eq(1)
# group consecutive values
g = m.ne(m.shift()).cumsum()
# identify stretches of 5+ 1s
m2 = m & df.groupby(g)['C'].transform('size').ge(5)

out = (df.loc[m2]
         .assign(group=pd.factorize(g[m2])[0]+1)
       )

输出量:

A       B  C  group
0       10    john  1      1
1       12    paul  1      1
2       23  kishan  1      1
3       12    teja  1      1
4       12    zebo  1      1
17    3434     coo  1      2
18     545     xoo  1      2
19    6565     zoo  1      2
20  232321     qoo  1      2
21   34454     woo  1      2
22  546556     eoo  1      2
d6kp6zgx

d6kp6zgx2#

你可以groupby列C的差,得到cumsum,其中差不等于零,并转换每个组的大小,这样你就只能保留大于或等于5的组

df[df['C'].eq(1) & df.groupby(df['C'].diff().ne(0).cumsum()).transform('size').gt(4)]

         A       B  C
0       10    john  1
1       12    paul  1
2       23  kishan  1
3       12    teja  1
4       12    zebo  1
17    3434     coo  1
18     545     xoo  1
19    6565     zoo  1
20  232321     qoo  1
21   34454     woo  1
22  546556     eoo  1

如果您想要一个组列,

# create the groups by calculating the diff and getting the cumsum
df['group'] = df['C'].diff().ne(0).cumsum()
# boolean indexing to keep values where C == 1 AND the size of each group is greater than 4
df[df['C'].eq(1) & df.groupby('group').transform('size').gt(4)]

         A       B  C  group
0       10    john  1      1
1       12    paul  1      1
2       23  kishan  1      1
3       12    teja  1      1
4       12    zebo  1      1
17    3434     coo  1      7
18     545     xoo  1      7
19    6565     zoo  1      7
20  232321     qoo  1      7
21   34454     woo  1      7
22  546556     eoo  1      7

相关问题