pandas 函数来识别重复的Python列名并添加特定的后缀

mutmk8jj  于 2023-06-28  发布在  Python
关注(0)|答案(2)|浏览(130)

我有几个 Dataframe 与某些重复的列名(他们来自Excel文件)。我的数据看起来有点像这样。

original_df= pd.DataFrame({
    'ID': [True, False, True],
    'Revenue (USDm)': [1000, 2000, 1500],
    'Location': ['London', 'New York', 'Paris'],
    'Year': [2021, 2022, 2023],
    'Sold Products': [10, 20, 30],
    'Leased Products': [5, 10, 15],
    'Investments': [7, 12, 8],
    'Sold Products.1': [15, 25, 35],
    'Leased Products.1': [8, 12, 16],
    'Investments.1': [6, 9, 11],
    'Sold Products.2': [5, 10, 15],
    'Leased Products.2': [2, 5, 8],
    'Investments.2': [3, 7, 4],
    'QC Completed?': [True, True, False],
})

当我读取df时,pandas会自动将.1.2后缀添加到重复的列名中。我尝试编写一个函数来识别重复项,并从我提供的列表中添加一组新的后缀,同时删除适用的.1.2
新的后缀列表是suffixes = ['Vehicles','Electronics','Real Estate']
输出应该如下所示:

desired_output = pd.DataFrame({
    'ID': [True, False, True],
    'Revenue (USDm)': [1000, 2000, 1500],
    'Location': ['London', 'New York', 'Paris'],
    'Year': [2021, 2022, 2023],
    'Sold Products - Vehicles': [10, 20, 30],
    'Leased Products - Vehicles': [5, 10, 15],
    'Investments - Vehicles': [7, 12, 8],
    'Sold Products - Electronics': [15, 25, 35],
    'Leased Products - Electronics': [8, 12, 16],
    'Investments - Electronics': [6, 9, 11],
    'Sold Products - Real Estate': [5, 10, 15],
    'Leased Products - Real Estate': [2, 5, 8],
    'Investments - Real Estate': [3, 7, 4],
    'QC Completed?': [True, True, False],
})

没有重复的列名称应保持不变,但重复的列按顺序添加后缀;如果它们也有.1.2后缀,则会被删除。
我的功能如下:

def change_colnames(df, suffixes):
    new_columns = []
    seen_columns = {}

    for column in df.columns:
        match = re.match(r'^(.*?)(?:\.\d+)?$', column)  # Match the base column name and optional suffix
        base_column = match.group(1) if match else column  # Get the base column name or keep the original column name

        if base_column in seen_columns:
            idx = seen_columns[base_column]  # Get the index of the base column
            new_column = f"{base_column} {suffixes[idx]}"  # Append the new suffix
            seen_columns[base_column] += 1  # Increment the index for the next occurrence
        else:
            new_column = base_column
            seen_columns[base_column] = 0  # Add the base column with index 0

        new_columns.append(new_column)

    df.columns = new_columns
    return df

不幸的是,第一组重复列(没有.1.2后缀的列)保持不变。我得到的输出是这样的:

wrong_output = pd.DataFrame({
    'ID': [True, False, True],
    'Revenue (USDm)': [1000, 2000, 1500],
    'Location': ['London', 'New York', 'Paris'],
    'Year': [2021, 2022, 2023],
    'Sold Products': [10, 20, 30],
    'Leased Products': [5, 10, 15],
    'Investments': [7, 12, 8],
    'Sold Products - Vehicles': [15, 25, 35],
    'Leased Products - Vehicles': [8, 12, 16],
    'Investments - Vehicles': [6, 9, 11],
    'Sold Products - Electronics': [5, 10, 15],
    'Leased Products - Electronics': [2, 5, 8],
    'Investments - Electronics': [3, 7, 4],
    'QC Completed?': [True, True, False],
})

有办法解决吗?

8i9zcol2

8i9zcol21#

通过enumerate创建字典,并通过GroupBy.cumcountMap计数器中的重复值:

suffixes = ['Vehicles','Electronics','Real Estate']
d = dict(enumerate(suffixes))

s = original_df.columns.to_series()

new = s.str.replace(r'\.\d+$','', regex=True)

mapped = (new.groupby(new).cumcount()
             .where(new.duplicated(keep=False)).map(d)
             .radd(' - ').fillna(''))

original_df.columns =  new + mapped
print (original_df)
      ID  Revenue (USDm)  Location  Year  Sold Products - Vehicles  \
0   True            1000    London  2021                        10   
1  False            2000  New York  2022                        20   
2   True            1500     Paris  2023                        30   

   Leased Products - Vehicles  Investments - Vehicles  \
0                           5                       7   
1                          10                      12   
2                          15                       8   

   Sold Products - Electronics  Leased Products - Electronics  \
0                           15                              8   
1                           25                             12   
2                           35                             16   

   Investments - Electronics  Sold Products - Real Estate  \
0                          6                            5   
1                          9                           10   
2                         11                           15   

   Leased Products - Real Estate  Investments - Real Estate  QC Completed?  
0                              2                          3           True  
1                              5                          7           True  
2                              8                          4          False
xxb16uws

xxb16uws2#

你可以使用正则表达式和str.replace,这里有一个自定义函数以提高灵活性:

import re

# identify duplicated columns
dup_cols = original_df.filter(regex=r'\.\d+$').columns

# get the base names of the duplicates
base = dup_cols.str.replace(r'\.\d+$', '').unique()
# ['Sold Products', 'Leased Products', 'Investments']

# craft a pattern
pattern = fr"^({'|'.join(map(re.escape, base))})(\.\d+)?$"
# '^(Sold\\ Products|Leased\\ Products|Investments)(\\.\\d+)?$'
suffixes = ['Vehicles','Electronics','Real Estate']
dic = dict(enumerate(suffixes))

def f(m):
    suffix = m.group(2)
    if suffix:
        suffix = dic.get(int(suffix[1:]), '')
    else:
        suffix = dic[0]
    return m.group(1) + ' - ' + suffix

# replace based on pattern
original_df.columns = original_df.columns.str.replace(pattern, f, regex=True)

输出:

ID  Revenue (USDm)  Location  Year  Sold Products - Vehicles  Leased Products - Vehicles  Investments - Vehicles  Sold Products - Electronics  Leased Products - Electronics  \
0   True            1000    London  2021                        10                           5                       7                           15                              8   
1  False            2000  New York  2022                        20                          10                      12                           25                             12   
2   True            1500     Paris  2023                        30                          15                       8                           35                             16   

   Investments - Electronics  Sold Products - Real Estate  Leased Products - Real Estate  Investments - Real Estate  QC Completed?  
0                          6                            5                              2                          3           True  
1                          9                           10                              5                          7           True  
2                         11                           15                              8                          4          False

相关问题