如何将重复行透视到新列Pandas

mnemlml8  于 2023-03-21  发布在  其他
关注(0)|答案(2)|浏览(121)

我有一个数据集,包含国家对的进口和出口。每个国家对都有两行,对应给定的年份。数据如下所示:

import pandas as pd
d = {
    "c1_id": [1,1,1,1,2,2,2,2],
    "c2_id":[2,2,2,2,1,1,1,1],
    "c1_name":["Austria","Austria","Austria","Austria","Denmark","Denmark","Denmark","Denmark"], 
    "c2_name":["Denmark","Denmark","Denmark","Denmark","Austria","Austria","Austria","Austria"],
    "year": [1970,1971,1972,1973,1970,1971,1972,1973],
    "export":[10,12,14,17,19,20,45,70],
    "import":[17,19,49,12,45,34,23,60]
    }
df = pd.DataFrame(d)
c1_id  c2_id  c1_name  c2_name  year  export  import
0      1      2  Austria  Denmark  1970      10      17
1      1      2  Austria  Denmark  1971      12      19
2      1      2  Austria  Denmark  1972      14      49
3      1      2  Austria  Denmark  1973      17      12
4      2      1  Denmark  Austria  1970      19      45
5      2      1  Denmark  Austria  1971      20      34
6      2      1  Denmark  Austria  1972      45      23
7      2      1  Denmark  Austria  1973      70      60

最后,我想要一个数据框,其中一行是给定年份,列是export1、export2、import1、import2:
一个二个一个一个
有没有办法在Python中做到这一点?
我尝试使用pivot但是没有成功。我也尝试unstack但是没有成功。

68bkxrlz

68bkxrlz1#

看起来使用掩码,重命名列,并执行merge可能是一个很好的方法:

# identify rows with c1_id < c2_id
m = df['c1_id'].lt(df['c2_id'])

# change column names
names = {'c1_id': 'c2_id', 'c2_id': 'c1_id',
         'c1_name': 'c2_name', 'c2_name': 'c1_name',
         'export': 'export2', 'import': 'import2'
        }

# merge the swapped rows
out = df[m].merge(df[~m].rename(columns=names), how='outer')

输出:

c1_id  c2_id  c1_name  c2_name  year  export  import  export2  import2
0      1      2  Austria  Denmark  1970      10      17       19       45
1      1      2  Austria  Denmark  1971      12      19       20       34
2      1      2  Austria  Denmark  1972      14      49       45       23
3      1      2  Austria  Denmark  1973      17      12       70       60
7cwmlq89

7cwmlq892#

探索另一种(快速)方法:

import pandas as pd
import time

d = {
    "c1_id": [1,1,1,1,2,2,2,2],
    "c2_id":[2,2,2,2,1,1,1,1],
    "c1_name":["Austria","Austria","Austria","Austria","Denmark","Denmark","Denmark","Denmark"], 
    "c2_name":["Denmark","Denmark","Denmark","Denmark","Austria","Austria","Austria","Austria"],
    "year": [1970,1971,1972,1973,1970,1971,1972,1973],
    "export":[10,12,14,17,19,20,45,70],
    "import":[17,19,49,12,45,34,23,60]
    }

df = pd.DataFrame(d)

start = time.time()

df = df.pivot(columns='c1_id')
# Count Nan found in even columns and shit-up by this amount
df.loc[:, ('c2_id', 2)::2] = \
     df.loc[:, ('c2_id', 2)::2].shift(-df.loc[:, ('c2_id', 2)].isna().sum())
# Remove duplicated columns and Nan rows
df = df.T.drop_duplicates().T.dropna()
# Merge columns levels into a unique level (and renaming)     
df.columns = [''.join(map(str, col)) for col in df.columns.to_flat_index()]

# Renaming taking into account pivot logic ie first col is c**2**_id and second is c**1**_id :
df.rename(columns = {'c2_id1': 'c1_id', 'c2_id2': 'c2_id', 'c1_name1': 'c1_name', 
                     'c1_name2': 'c2_name', 'year1':'year', 'import1': 'import',
                     'export1':'export'})
end = time.time()

print("speed : ", end-start, " seconds \n")
print(df)
speed :  0.04899238586425781  seconds 

  c2_id1 c2_id2 c1_name1 c1_name2 year1 export1 export2 import1 import2
0      2      1  Austria  Denmark  1970      10      19      17      45
1      2      1  Austria  Denmark  1971      12      20      19      34
2      2      1  Austria  Denmark  1972      14      45      49      23
3      2      1  Austria  Denmark  1973      17      70      12      60

相关问题