pandas 如何使用正则表达式合并 Dataframe 的多列?

fnvucqvd  于 2023-02-07  发布在  其他
关注(0)|答案(2)|浏览(118)

我有一个df,如下所示

import pandas as pd

df = pd.DataFrame(
    {'number_C1_E1': ['1', '2', None, None, '5', '6', '7', '8'],
     'fruit_C11_E1': ['apple', 'banana', None, None, 'watermelon', 'peach', 'orange', 'lemon'],
     'name_C111_E1': ['tom', 'jerry', None, None, 'paul', 'edward', 'reggie', 'nicholas'],
     'number_C2_E2': [None, None, '3', None, None, None, None, None],
     'fruit_C22_E2': [None, None, 'blueberry', None, None, None, None, None],
     'name_C222_E2': [None, None, 'anthony', None, None, None, None, None],
     'number_C3_E1': [None, None, '3', '4', None, None, None, None],
     'fruit_C33_E1': [None, None, 'blueberry', 'strawberry', None, None, None, None],
     'name_C333_E1': [None, None, 'anthony', 'terry', None, None, None, None],
     }
)

这里我想做的是合并这些列,我们有两个规则:
1.如果一列删除_C{0~9}_C{0~9}{0~9}_C{0~9}{0~9}{0~9}等于另一列,则可以合并这两列。
我们以number_C1_E1number_C2_E2number_C3_E1为例,这里number_C1_E1number_C3_E1可以合并,因为它们都是removing _C{0~9}之后的number_E1
1.合并后的两列应该去掉None值。
所需结果为

number_C1_1_E1 fruit_C11_1_E1 name_C111_1_E1 number_C2_1_E2 fruit_C22_1_E2 name_C222_1_E2
0              1          apple            tom           None           None           None
1              2         banana          jerry           None           None           None
2              3      blueberry        anthony              3      blueberry        anthony
3              4     strawberry          terry           None           None           None
4              5     watermelon           paul           None           None           None
5              6          peach         edward           None           None           None
6              7         orange         reggie           None           None           None
7              8          lemon       nicholas           None           None           None

有人有好的解决办法吗?

juzqafwq

juzqafwq1#

使用与上一个问题相同的方法,但还要计算列的重命名器:

group = df.columns.str.replace(r'_C\d+', '', regex=True)

names = df.columns.to_series().groupby(group).first()

out = (df.groupby(group, axis=1, sort=False).first()
         .rename(columns=names)
       )

备选方案:

group = df.columns.str.replace(r'_C\d+', '', regex=True)

out = (df.groupby(group, axis=1, sort=False).first()
         .set_axis(df.columns[~group.duplicated()], axis=1)
       )

输出:

number_C1_E1 fruit_C11_E1 name_C111_E1 number_C2_E2 fruit_C22_E2 name_C222_E2
0            1        apple          tom         None         None         None
1            2       banana        jerry         None         None         None
2            3    blueberry      anthony            3    blueberry      anthony
3            4   strawberry        terry         None         None         None
4            5   watermelon         paul         None         None         None
5            6        peach       edward         None         None         None
6            7       orange       reggie         None         None         None
7            8        lemon     nicholas         None         None         None
umuewwlo

umuewwlo2#

这是基于@mozway的想法,并试图避免双重groupby:

box = df.columns.str.split(r"C\d+", expand=True)
box = df.groupby(box, axis=1)
columns = [grp.columns[0] for _, grp in box]
frame = box.first()
frame.columns = columns
frame
  fruit_C11_E1 fruit_C22_E2 name_C111_E1 name_C222_E2 number_C1_E1 number_C2_E2
0        apple         None          tom         None            1         None
1       banana         None        jerry         None            2         None
2    blueberry    blueberry      anthony      anthony            3            3
3   strawberry         None        terry         None            4         None
4   watermelon         None         paul         None            5         None
5        peach         None       edward         None            6         None
6       orange         None       reggie         None            7         None
7        lemon         None     nicholas         None            8         None

相关问题