pandas 给定时间范围panda的 Dataframe 内的交叉连接

roejwanj  于 2022-12-16  发布在  其他
关注(0)|答案(2)|浏览(123)

我有一个 Dataframe 如下:

ID    GROUP   DATE_MIN               DATE_MAX
1     L1      02/12/2022 6:30AM      02/12/2022 6:35AM
2     L1      02/12/2022 6:33AM      02/12/2022 6:40AM
3     L1      02/12/2022 6:37AM      02/12/2022 6:40AM
4     L2      02/12/2022 7:30AM      02/12/2022 7:35AM
5     L2      02/12/2022 7:33AM      02/12/2022 7:35AM
6     L2      02/12/2022 7:34AM      02/12/2022 7:38AM

我想计算时间范围(DATE_MIN、DATE_MAX)之间每组(GROUP列)的行数,预期输出为

ID    GROUP   DATE_MIN               DATE_MAX                NumberOfRows
1     L1      02/12/2022 6:30AM      02/12/2022 6:35AM        2 <<because of ID 1 and 2>>
2     L1      02/12/2022 6:33AM      02/12/2022 6:40AM        3 <<because of ID 1, 2 and 3>>
3     L1      02/12/2022 6:37AM      02/12/2022 6:40AM        2 <<because of ID 3 and 2>>
4     L2      02/12/2022 7:30AM      02/12/2022 7:35AM        1 << because of 4 only>>
5     L2      02/12/2022 7:36AM      02/12/2022 7:40AM        2 <<because of 5 and 6>>
6     L2      02/12/2022 7:37AM      02/12/2022 7:40AM        2 <<because of 5 and 6>>
ao218c7q

ao218c7q1#

假设您的组不是很大(因为此解决方案为每个组构建一个方阵),您可以使用numpy

def count(g):
    # convert to datetime arrays
    d_min = pd.to_datetime(g['DATE_MIN']).to_numpy()
    d_max = pd.to_datetime(g['DATE_MAX']).to_numpy()
    # build the square comparisons to see if the intervals overlap
    # aggregate as sum
    return pd.Series((  (d_min[:,None]<=d_max)
                      & (d_max[:,None]>=d_min)
                     ).sum(axis=0),
                     index=g.index)

df['NumberOfRows'] = df.groupby('GROUP', group_keys=False).apply(count)
  • 注意:如果您最初将日期转换为datetime,然后从函数中删除转换,效率可能会更高。*

输出:

ID GROUP           DATE_MIN           DATE_MAX  NumberOfRows
0   1    L1  02/12/2022 6:30AM  02/12/2022 6:35AM             2
1   2    L1  02/12/2022 6:33AM  02/12/2022 6:40AM             3
2   3    L1  02/12/2022 6:37AM  02/12/2022 6:40AM             2
3   4    L2  02/12/2022 7:30AM  02/12/2022 7:35AM             1
4   5    L2  02/12/2022 7:36AM  02/12/2022 7:40AM             2
5   6    L2  02/12/2022 7:37AM  02/12/2022 7:40AM             2
bpzcxfmw

bpzcxfmw2#

df [“行数”] = df.分组依据(“分组”)[“日期_分钟”].转换(λ x:x.介于(df [“日期最小值”],df [“日期最大值”]).sum())打印(df)

相关问题