pandas 如何在一定条件下实现数据点的一一匹配

bttbmeg0 于 2022-12-02 发布在其他

关注(0)|答案(1)|浏览(108)

假设我有一个 Dataframe ，如下所示

df = pd.DataFrame(columns=['ID', 'job', 'eligible', "date"])
df['ID'] = ['1', '2', '3', '4', '5', '6', '7', '8']
df['job'] = ['waitress', 'doctor', 'benevolent', 'nurse', 'hairstylist', 'banker', 'waitress', 'waitress']
df['eligible'] = [No, Yes, No, Yes, No, No, No, No]
df['date'] = ['1.1.2016', '31.12.2015', '1.1.2016', '31.12.2015', '1.1.2016', '31.12.2015', '1.1.2015', '1.1.2015']

df["date"] = pd.to_datetime(df["date"])

我想将数据与匹配的工作、资格和不匹配的年份（2015年与2016年）配对。这只是一对一的匹配，这意味着一部分数据可能有几个匹配或根本没有。如果有几个匹配，将随机选择匹配的配对。
因此，我希望得到如下结果：

df_paired = (columns=['ID', 'job', 'eligible', "paired_ID"])
df['ID'] = ['1']
df['paired_ID'] = ['8']
df['job'] = ['waitress']
df['eligible'] = [No]

我尝试了很多解决方案，但主要问题是一对一的匹配，以获得独特的匹配，甚至坚韧可能有几个匹配的一个观察...

pandas

来源：https://stackoverflow.com/questions/74607527/how-to-match-one-to-one-data-point-based-on-some-conditions

1条答案

按热度按时间

unguejic1#

此解决方案使用groupby方法查找共享job和eligible值的行。然后标识共享相同year的子组。选择随机索引以选择要分配给paired_ID的ID。

import numpy as np
import pandas as pd

df = pd.DataFrame(columns=['ID', 'job', 'eligible', "date"])
df['ID'] = ['1', '2', '3', '4', '5', '6', '7', '8']
df['job'] = ['waitress', 'doctor', 'benevolent', 'nurse', 'hairstylist', 'banker', 'waitress', 'waitress']
df['eligible'] = ['No', 'Yes', 'No', 'Yes', 'No', 'No', 'No', 'No']
df['date'] = ['1.1.2016', '31.12.2015', '1.1.2016', '31.12.2015', '1.1.2016', '31.12.2015', '1.1.2015', '1.1.2015']

df["date"] = pd.to_datetime(df["date"])

# Add column
df['paired_ID'] = None

for _, matches in df.groupby(['job', 'eligible']):
    if (len(matches) > 1):
        year_groups = matches.groupby(by=matches['date'].dt.year)
        if (len(year_groups) > 1):
            years = tuple(year_groups.groups.keys())
            for y in years:
                pair_y = np.random.choice(tuple([x for x in years if x != y]))
                for index in year_groups.groups[y]:
                    paired_index = np.random.choice(year_groups.groups[pair_y])
                    paired_ID = int(df['ID'].iloc[paired_index])
                    df['paired_ID'].iloc[index] = paired_ID

赞(0）回复(0）举报 2022-12-02

我来回答

pandas 如何在一定条件下实现数据点的一一匹配

1条答案

相关问题

热门标签

最新问答