pandas Python:一个数据集,它有不同列值的重复ID,我如何为每个组只选择一行?

ogq8wdun  于 2023-02-07  发布在  Python
关注(0)|答案(6)|浏览(125)

我正在尝试创建一个计数,在运行下一个ID时将区域替换为一行。我希望每个ID都有一个结果。
尝试了几种方法,但似乎都不起作用,缺乏想法。
数据集
| 识别号|区域|
| - ------|- ------|
| 1个|北|
| 1个|南部|
| 1个|东|
| 1个|西|
| 第二章|北|
| 第二章|南部|
| 第二章|东|
| 第二章|西|
| 三个|北|
| 三个|南部|
| 三个|东|
| 三个|西|
| 四个|北|
| 四个|南部|
| 四个|东|
| 四个|西|
| 五个|西北|
| 五个|西南|
| 六个|西北|
| 六个|西南|
预期输出
| 识别号|区域|
| - ------|- ------|
| 1个|北|
| 第二章|南部|
| 三个|东|
| 四个|西|
| 五个|西北|
| 六个|西南|

zf9nrax1

zf9nrax11#

您可以对两列都执行factorize操作,并保留秩相等的行:

out = df.loc[pd.factorize(df['Region'])[0] == pd.factorize(df['ID'])[0]]

输出:

ID      Region
0    1       North
5    2       South
10   3        East
15   4        West
16   5   Northwest
19   6  South West

另一个想法是,使用一个中间的矩形矩阵并取其对角线呢?

import numpy as np

df2 = (df.pivot(index='ID', columns='Region', values='Region')
         .reindex(index=df['ID'].unique(), columns=df['Region'].unique())
      )

out = pd.DataFrame({'ID': df2.index, 'Region': np.diag(df2)})

输出:

ID      Region
0   1       North
1   2       South
2   3        East
3   4        West
4   5   Northwest
5   6  South West

中间矩形矩阵:

Region  North  South  East  West  Northwest  South West
ID                                                     
1       North  South  East  West        NaN         NaN
2       North  South  East  West        NaN         NaN
3       North  South  East  West        NaN         NaN
4       North  South  East  West        NaN         NaN
5         NaN    NaN   NaN   NaN  Northwest  South West
6         NaN    NaN   NaN   NaN  Northwest  South West
8fq7wneg

8fq7wneg2#

我认为方法“.drop_duplicates”可能会解决您的问题。https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
我认为您应该使用“subset”参数,如上述链接的文档中所示。

jhdbpxl9

jhdbpxl93#

可以将唯一的区域名称收集到一个集合中,然后对于每个组(按ID分组)提取下一个可用区域:

def get_next_region(r_group, regions):
    for reg in r_group:
        if reg in regions:
            regions.remove(reg)
            break
    return reg

regions = set(df.Region.unique())  # unique regions
reg_df = df.groupby('ID')['Region'].apply(get_next_region, regions=regions.copy())\
    .reset_index(name="Region")

print(reg_df)
ID      Region
0   1       North
1   2       South
2   3        East
3   4        West
4   5   Northwest
5   6  South West
yvt65v4c

yvt65v4c4#

下面是一个可能的解决方案。它确实会为您给出的示例产生所需的输出,但不一定会泛化到ID值的任何组合或Regions的组合。如果您正在寻找不需要设置循环的Pandas方法,它可能会很有用。

import pandas as pd
import math
# Create pandas DataFrame
df = pd.DataFrame({'ID':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,6,6],
                   'Region':['North','South','East','West',
                            'North','South','East','West',
                            'North','South','East','West',
                            'North','South','East','West',
                            'Northwest','South West','Northwest','South West'
                            ]
                  })

# A method of indexing the regions per ID from 1 to n regions
df = df.reset_index().rename(columns={'index':'Region Idx'})
offset = df['Region Idx']*df['ID'].diff()
offset.loc[offset==0] = math.nan
offset = offset.ffill().fillna(0)
df['Region Idx'] = df['Region Idx'] - offset + 1

# Assuming your IDs are integers in ascending order 
# and you don't have any special cases with the number of regions per ID,
# this can be used for the rolling region selection per ID
df['Max Region'] = df[['ID','Region Idx']].groupby('ID').transform('max')
df['Selected Region Idx'] = (df['ID']-1) % df['Max Region'] + 1

# Final result
result = df.loc[df['Region Idx']==df['Selected Region Idx'],['ID','Region']]
hs1rzwqc

hs1rzwqc5#

一个可能的解决方案

unique_ids = df['ID'].unique()
res_df = pd.DataFrame(columns=['ID', 'Region'])

for i, id in enumerate(unique_ids):
    region = df[df['ID'] == id]['Region'].iloc[i % 4]
    res_df = pd.concat([res_df, pd.DataFrame({'ID': [id], 'Region': [region]})], 
                                                ignore_index=True)
print(res_df)
ID      Region
0  1       North
1  2       South
2  3        East
3  4        West
4  5   Northwest
5  6  South West
x759pob2

x759pob26#

使用groupby.ngroupdrop_duplicates的另一个想法是:

id_idx = df.groupby('ID').ngroup().drop_duplicates().index
region_idx = df.groupby('Region').ngroup().drop_duplicates().index
out = pd.DataFrame()
out['ID'], out['Region'] = df.loc[id_idx, 'ID'].values, df.loc[region_idx, 'Region'].values

print(out)

   ID      Region
0   1       North
1   2       South
2   3        East
3   4        West
4   5   Northwest
5   6  South West

相关问题