我有几个 Dataframe 与某些重复的列名(他们来自Excel文件)。我的数据看起来有点像这样。
original_df= pd.DataFrame({
'ID': [True, False, True],
'Revenue (USDm)': [1000, 2000, 1500],
'Location': ['London', 'New York', 'Paris'],
'Year': [2021, 2022, 2023],
'Sold Products': [10, 20, 30],
'Leased Products': [5, 10, 15],
'Investments': [7, 12, 8],
'Sold Products.1': [15, 25, 35],
'Leased Products.1': [8, 12, 16],
'Investments.1': [6, 9, 11],
'Sold Products.2': [5, 10, 15],
'Leased Products.2': [2, 5, 8],
'Investments.2': [3, 7, 4],
'QC Completed?': [True, True, False],
})
当我读取df时,pandas会自动将.1
和.2
后缀添加到重复的列名中。我尝试编写一个函数来识别重复项,并从我提供的列表中添加一组新的后缀,同时删除适用的.1
和.2
。
新的后缀列表是suffixes = ['Vehicles','Electronics','Real Estate']
输出应该如下所示:
desired_output = pd.DataFrame({
'ID': [True, False, True],
'Revenue (USDm)': [1000, 2000, 1500],
'Location': ['London', 'New York', 'Paris'],
'Year': [2021, 2022, 2023],
'Sold Products - Vehicles': [10, 20, 30],
'Leased Products - Vehicles': [5, 10, 15],
'Investments - Vehicles': [7, 12, 8],
'Sold Products - Electronics': [15, 25, 35],
'Leased Products - Electronics': [8, 12, 16],
'Investments - Electronics': [6, 9, 11],
'Sold Products - Real Estate': [5, 10, 15],
'Leased Products - Real Estate': [2, 5, 8],
'Investments - Real Estate': [3, 7, 4],
'QC Completed?': [True, True, False],
})
没有重复的列名称应保持不变,但重复的列按顺序添加后缀;如果它们也有.1
和.2
后缀,则会被删除。
我的功能如下:
def change_colnames(df, suffixes):
new_columns = []
seen_columns = {}
for column in df.columns:
match = re.match(r'^(.*?)(?:\.\d+)?$', column) # Match the base column name and optional suffix
base_column = match.group(1) if match else column # Get the base column name or keep the original column name
if base_column in seen_columns:
idx = seen_columns[base_column] # Get the index of the base column
new_column = f"{base_column} {suffixes[idx]}" # Append the new suffix
seen_columns[base_column] += 1 # Increment the index for the next occurrence
else:
new_column = base_column
seen_columns[base_column] = 0 # Add the base column with index 0
new_columns.append(new_column)
df.columns = new_columns
return df
不幸的是,第一组重复列(没有.1
和.2
后缀的列)保持不变。我得到的输出是这样的:
wrong_output = pd.DataFrame({
'ID': [True, False, True],
'Revenue (USDm)': [1000, 2000, 1500],
'Location': ['London', 'New York', 'Paris'],
'Year': [2021, 2022, 2023],
'Sold Products': [10, 20, 30],
'Leased Products': [5, 10, 15],
'Investments': [7, 12, 8],
'Sold Products - Vehicles': [15, 25, 35],
'Leased Products - Vehicles': [8, 12, 16],
'Investments - Vehicles': [6, 9, 11],
'Sold Products - Electronics': [5, 10, 15],
'Leased Products - Electronics': [2, 5, 8],
'Investments - Electronics': [3, 7, 4],
'QC Completed?': [True, True, False],
})
有办法解决吗?
2条答案
按热度按时间8i9zcol21#
通过
enumerate
创建字典,并通过GroupBy.cumcount
Map计数器中的重复值:xxb16uws2#
你可以使用正则表达式和
str.replace
,这里有一个自定义函数以提高灵活性:输出: