pandas 如果df2的开始和结束日期在python(panda)中df1的开始和结束日期范围内,则合并两个 Dataframe

vktxenjb  于 2023-01-11  发布在  Python
关注(0)|答案(3)|浏览(152)

我有两个 Dataframe DF1和DF2
df1 =

id      start        end
 a  1/12/2022 18/12/2022
 a 19/12/2022 25/12/2022
 a 26/12/2022 31/12/2022
 b 01/12/2022 20/12/2022
 b 21/12/2022 31/12/2022
 c 01/12/2022 31/12/2022
 d 01/12/2022 15/12/2022
 d 16/12/2022 31/12/2022

和第二 Dataframe 作为
DF2

id    start_2      end_2  number
 a 15/12/2022 15/12/2022       1
 b 17/12/2022 19/12/2022       3
 b 25/12/2022 27/12/2022       2
 c 12/12/2022 12/12/2022       1
 d 03/12/2022 04/12/2022       2
 d 25/12/2022 25/12/2022       1

我想按ID合并两个 Dataframe 左连接(df1和df2)。并调整df1中相同日期范围(开始和结束日期)中的列"编号"。例如,如果在df2中,ID "a"的编号为1,则它应出现在"a"的第一行(1/12/2022至18/12/2022),而不是其他插槽中。其他插槽应为零。如下所示
结果df

id      start        end  number
 a  1/12/2022 18/12/2022       1
 a 19/12/2022 25/12/2022       0
 a 26/12/2022 31/12/2022       0
 b 01/12/2022 20/12/2022       3
 b 21/12/2022 31/12/2022       2
 c 01/12/2022 31/12/2022       1
 d 01/12/2022 15/12/2022       2
 d 16/12/2022 31/12/2022       1

注意,如果两个数字位于df1的同一插槽中,则应进行groupby求和。

nhn9ugyo

nhn9ugyo1#

这是一个变通方法。合并后,设置startend条件,然后充分利用.locgroupby

df1["start"] = pd.to_datetime(df1["start"], dayfirst=True)
df1["end"] = pd.to_datetime(df1["end"], dayfirst=True)
df2["start_2"] = pd.to_datetime(df2["start_2"], dayfirst=True)
df2["end_2"] = pd.to_datetime(df2["end_2"], dayfirst=True)

merged_df = pd.merge(df1, df2, on="id", how="left")
merged_df["number_adj"] = 0

start_condition = (merged_df["start_2"] >= merged_df["start"]) & (merged_df["start_2"] <= merged_df["end"])
end_condition = (merged_df["end_2"] >= merged_df["start"]) & (merged_df["end_2"] <= merged_df["end"])

merged_df.loc[start_condition | end_condition, "number_adj"] = merged_df["number"]
merged_df = merged_df.groupby(["id", "start", "end"]).sum().reset_index()
merged_df.drop("number", axis=1, inplace=True)
merged_df.rename(columns={"number_adj": "number"}, inplace=True)

print(merged_df)

输出:

id      start        end  number
0  a 2022-12-01 2022-12-18       1
1  a 2022-12-19 2022-12-25       0
2  a 2022-12-26 2022-12-31       0
3  b 2022-12-01 2022-12-20       3
4  b 2022-12-21 2022-12-31       2
5  c 2022-12-01 2022-12-31       1
6  d 2022-12-01 2022-12-15       2
7  d 2022-12-16 2022-12-31       1
z3yyvxxp

z3yyvxxp2#

可以将concat和groupby与size()方法一起使用。

df = pd.concat([df1, df2])
df.groupby(["start", "end"]).size()
rryofs0p

rryofs0p3#

您可以在id上合并,然后过滤出您的列表:

# Convert to DatetimeIndex if necessary
df1['start'] = pd.to_datetime(df1['start'], dayfirst=True)
df1['end'] = pd.to_datetime(df1['end'], dayfirst=True)
df2['start_2'] = pd.to_datetime(df2['start_2'], dayfirst=True)
df2['end_2'] = pd.to_datetime(df2['end_2'], dayfirst=True)

# Merge on id, reset_index to preserve original index on merge
out = df1.reset_index().merge(df2, on='id', how='left')

# Check intervals
out['indicator'] = (out['start'] < out['start_2']) & (out['end_2'] < out['end'])

# Filter the list and set to 0 other slots
out = out.loc[out.groupby('index')['indicator'].idxmax()]
out.loc[~out['indicator'], 'number'] = 0

# Get the final dataframe
out = out[df1.columns.tolist() + ['number']].set_index(df1.index)

输出:

>>> out
  id      start        end  number
0  a 2022-12-01 2022-12-18       1
1  a 2022-12-19 2022-12-25       0
2  a 2022-12-26 2022-12-31       0
3  b 2022-12-01 2022-12-20       3
4  b 2022-12-21 2022-12-31       2
5  c 2022-12-01 2022-12-31       1
6  d 2022-12-01 2022-12-15       0
7  d 2022-12-16 2022-12-31       1

相关问题