pandas Polars -计算一段时间内的唯一值

rqcrx0a6  于 2023-06-20  发布在  其他
关注(0)|答案(2)|浏览(74)

我正在将管道从pandas迁移到polars,数据是停靠在仓库中的卡车的到达和离开,在管道的某个步骤中,我需要计算在任何给定时间停靠的卡车数量,即,对于每一行(一辆停靠的卡车)我需要计算在一个时间窗口(- 1分钟的到达时间和+ 1分钟的离开时间)内存在的唯一卡车(“ID”)的数量。我还没有找到一种有效的方法,不依赖于逐行应用函数(pandas的风格)。

data = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time        ┆ departure_time      ┆ ID  │
│ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  │
└─────────────────────┴─────────────────────┴─────┘
""")

到目前为止,我的代码,第一部分是在polars和最后一个仍然使用pandas

processed_data = (
    data.sort(by=pl.col("arrival_time"))
    .with_columns(
        arrival_time_expanded=pl.col("arrival_time").dt.offset_by("-1m"),
        departure_time_expanded=pl.col("departure_time").dt.offset_by("1m"),
    )
    .to_pandas()
)
processed_data = processed_data.assign(
    docked_trucks=processed_data.apply(
        lambda row: processed_data[
            (processed_data.arrival_time_expanded <= row.departure_time)
            & (processed_data.departure_time_expanded >= row.arrival_time)
        ]["ID"].nunique(),
        axis=1,
    )
)

结果:

┌──────────────┬────────────────┬─────┬──────────────────────┬─────────────────────┬───────────────┐
│ arrival_time ┆ departure_time ┆ ID  ┆ arrival_time_expande ┆ departure_time_expa ┆ docked_trucks │
│ ---          ┆ ---            ┆ --- ┆ d                    ┆ nded                ┆ ---           │
│ datetime[ns] ┆ datetime[ns]   ┆ str ┆ ---                  ┆ ---                 ┆ i64           │
│              ┆                ┆     ┆ datetime[ns]         ┆ datetime[ns]        ┆               │
╞══════════════╪════════════════╪═════╪══════════════════════╪═════════════════════╪═══════════════╡
│ 2023-01-01   ┆ 2023-01-01     ┆ A1  ┆ 2023-01-01 06:22:47  ┆ 2023-01-01 06:26:08 ┆ 1             │
│ 06:23:47     ┆ 06:25:08       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ A1  ┆ 2023-01-01 06:25:42  ┆ 2023-01-01 06:29:02 ┆ 1             │
│ 06:26:42     ┆ 06:28:02       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ A5  ┆ 2023-01-01 06:29:20  ┆ 2023-01-01 06:36:01 ┆ 4             │
│ 06:30:20     ┆ 06:35:01       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ A6  ┆ 2023-01-01 06:31:06  ┆ 2023-01-01 06:34:48 ┆ 4             │
│ 06:32:06     ┆ 06:33:48       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ B3  ┆ 2023-01-01 06:32:09  ┆ 2023-01-01 06:37:01 ┆ 4             │
│ 06:33:09     ┆ 06:36:01       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ C3  ┆ 2023-01-01 06:33:08  ┆ 2023-01-01 06:40:49 ┆ 4             │
│ 06:34:08     ┆ 06:39:49       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ A6  ┆ 2023-01-01 06:35:40  ┆ 2023-01-01 06:39:34 ┆ 4             │
│ 06:36:40     ┆ 06:38:34       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ A5  ┆ 2023-01-01 06:36:43  ┆ 2023-01-01 06:41:48 ┆ 3             │
│ 06:37:43     ┆ 06:40:48       ┆     ┆                      ┆                     ┆               │
│ 2023-01-01   ┆ 2023-01-01     ┆ A6  ┆ 2023-01-01 06:38:48  ┆ 2023-01-01 06:47:10 ┆ 3             │
│ 06:39:48     ┆ 06:46:10       ┆     ┆                      ┆                     ┆               │
└──────────────┴────────────────┴─────┴──────────────────────┴─────────────────────┴───────────────┘
ruarlubt

ruarlubt1#

如果我们用.melt()将数据整形为“长格式
看起来这可能是一个Asof连接类型的问题。

times = df.melt('ID', variable_name='arrival/departure', value_name='time')

windows = df.with_columns(
   pl.col('arrival_time').dt.offset_by('-1m'),
   pl.col('departure_time').dt.offset_by('1m')
).melt('ID', variable_name='arrival/departure', value_name='window')

(times.join_asof(windows, left_on='time', right_on='window', by='arrival/departure')
      .filter(pl.col('arrival/departure') == 'arrival_time')
      .groupby('window')
      .agg('ID', pl.count())
)
shape: (3, 3)
┌─────────────────────┬──────────────┬───────┐
│ window              ┆ ID           ┆ count │
│ ---                 ┆ ---          ┆ ---   │
│ datetime[μs]        ┆ list[str]    ┆ u32   │
╞═════════════════════╪══════════════╪═══════╡
│ 2023-01-21 10:30:00 ┆ ["A6", "B3"] ┆ 2     │
│ 2023-01-21 10:00:03 ┆ ["A1", "A2"] ┆ 2     │
│ 2023-01-21 10:19:26 ┆ ["A5"]       ┆ 1     │
└─────────────────────┴──────────────┴───────┘

**更新:**也许这是一个更简单的方法。

events = pl.concat(
   [
      df.melt('ID', variable_name='event', value_name='time'),
      df.select('ID',
         window_open  = pl.col("arrival_time").dt.offset_by("-1m"),
         window_close = pl.col("departure_time").dt.offset_by("1m")
      ).melt('ID', variable_name='event', value_name='time')
   ]
).sort('time')
shape: (20, 3)
┌─────┬────────────────┬─────────────────────┐
│ ID  ┆ event          ┆ time                │
│ --- ┆ ---            ┆ ---                 │
│ str ┆ str            ┆ datetime[μs]        │
╞═════╪════════════════╪═════════════════════╡
│ A1  ┆ window_open    ┆ 2023-01-21 09:59:09 │
│ A2  ┆ window_open    ┆ 2023-01-21 10:00:03 │
│ A1  ┆ arrival_time   ┆ 2023-01-21 10:00:09 │
│ A2  ┆ arrival_time   ┆ 2023-01-21 10:01:03 │
│ A1  ┆ departure_time ┆ 2023-01-21 10:02:06 │
│ A1  ┆ window_close   ┆ 2023-01-21 10:03:06 │
│ A2  ┆ departure_time ┆ 2023-01-21 10:05:08 │
│ A2  ┆ window_close   ┆ 2023-01-21 10:06:08 │
│ A5  ┆ window_open    ┆ 2023-01-21 10:19:26 │
│ A5  ┆ arrival_time   ┆ 2023-01-21 10:20:26 │
│ A5  ┆ departure_time ┆ 2023-01-21 10:22:36 │
│ A5  ┆ window_close   ┆ 2023-01-21 10:23:36 │
│ A6  ┆ window_open    ┆ 2023-01-21 10:29:08 │
│ B3  ┆ window_open    ┆ 2023-01-21 10:30:00 │
│ A6  ┆ arrival_time   ┆ 2023-01-21 10:30:08 │
│ B3  ┆ arrival_time   ┆ 2023-01-21 10:31:00 │
│ B3  ┆ departure_time ┆ 2023-01-21 10:33:01 │
│ B3  ┆ window_close   ┆ 2023-01-21 10:34:01 │
│ A6  ┆ departure_time ┆ 2023-01-21 10:34:18 │
│ A6  ┆ window_close   ┆ 2023-01-21 10:35:18 │
└─────┴────────────────┴─────────────────────┘

然后用.cumsum创建一个窗口id并计算到达时间。

(events.with_columns(window_id = (pl.col('event') == 'window_close').cumsum())
       .filter(pl.col('event') == 'arrival_time')
       .with_columns(pl.count().over('window_id'))
)
shape: (5, 5)
┌─────┬──────────────┬─────────────────────┬───────────┬───────┐
│ ID  ┆ event        ┆ time                ┆ window_id ┆ count │
│ --- ┆ ---          ┆ ---                 ┆ ---       ┆ ---   │
│ str ┆ str          ┆ datetime[μs]        ┆ u32       ┆ u32   │
╞═════╪══════════════╪═════════════════════╪═══════════╪═══════╡
│ A1  ┆ arrival_time ┆ 2023-01-21 10:00:09 ┆ 0         ┆ 2     │
│ A2  ┆ arrival_time ┆ 2023-01-21 10:01:03 ┆ 0         ┆ 2     │
│ A5  ┆ arrival_time ┆ 2023-01-21 10:20:26 ┆ 2         ┆ 1     │
│ A6  ┆ arrival_time ┆ 2023-01-21 10:30:08 ┆ 3         ┆ 2     │
│ B3  ┆ arrival_time ┆ 2023-01-21 10:31:00 ┆ 3         ┆ 2     │
└─────┴──────────────┴─────────────────────┴───────────┴───────┘
qojgxg4l

qojgxg4l2#

所以我想出了两个选项,这两个选项都可能会因为大量的数据而爆炸:

第一种解决方案:

首先执行交叉联接,然后过滤错误的结果,然后执行groupby。不幸的是,由于交叉连接,这个查询可能会爆炸大量的数据。

data = processed_data.join(data, how="cross").filter(
    (pl.col("arrival_time_expanded") < pl.col("departure_time_right"))
    & (pl.col("departure_time_expanded") > pl.col("arrival_time_right"))
)
data.groupby(
    pl.col(
        "arrival_time",
        "departure_time",
        "ID",
        "arrival_time_expanded",
        "departure_time_expanded",
    )
).agg(pl.count())

第二种解决方案:

这可能会在大数据上表现得更好,但我还没有测试过,但你也会失去一点精度。为了减少基数,我们将到达和离开时间舍入到下一分钟,然后将表分解为每分钟有哪辆卡车在仓库中。

time_precision = "1m"

processed_data = data.sort(by=pl.col("arrival_time")).with_columns(
    arrival_time_expanded=pl.col("arrival_time").dt.round(time_precision).dt.offset_by("-1m"),
    departure_time_expanded=pl.col("departure_time").dt.round(time_precision).dt.offset_by("1m"),
)
processed_data = processed_data.with_columns(
    pl.date_range(
        pl.col("arrival_time_expanded"),
        pl.col("departure_time_expanded"),
        interval=time_precision,
        eager=False,
    ).alias("interval")
).explode("interval").unique(["ID", "interval"])

processed_data.with_columns(pl.count().over(pl.col("interval"))).groupby(
    pl.col(
        "arrival_time",
        "departure_time",
        "ID",
        "arrival_time_expanded",
        "departure_time_expanded",
    )
).agg(pl.col("count").max()).sort("arrival_time")

shape: (9, 6)
┌─────────────────────┬─────────────────────┬─────┬───────────────────────┬─────────────────────────┬───────┐
│ arrival_time        ┆ departure_time      ┆ ID  ┆ arrival_time_expanded ┆ departure_time_expanded ┆ count │
│ ---                 ┆ ---                 ┆ --- ┆ ---                   ┆ ---                     ┆ ---   │
│ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]          ┆ datetime[μs]            ┆ u32   │
╞═════════════════════╪═════════════════════╪═════╪═══════════════════════╪═════════════════════════╪═══════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  ┆ 2023-01-01 06:23:00   ┆ 2023-01-01 06:26:00     ┆ 1     │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  ┆ 2023-01-01 06:26:00   ┆ 2023-01-01 06:29:00     ┆ 2     │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  ┆ 2023-01-01 06:29:00   ┆ 2023-01-01 06:36:00     ┆ 4     │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  ┆ 2023-01-01 06:31:00   ┆ 2023-01-01 06:35:00     ┆ 4     │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  ┆ 2023-01-01 06:32:00   ┆ 2023-01-01 06:37:00     ┆ 4     │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  ┆ 2023-01-01 06:33:00   ┆ 2023-01-01 06:41:00     ┆ 4     │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  ┆ 2023-01-01 06:36:00   ┆ 2023-01-01 06:40:00     ┆ 4     │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  ┆ 2023-01-01 06:37:00   ┆ 2023-01-01 06:42:00     ┆ 4     │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  ┆ 2023-01-01 06:39:00   ┆ 2023-01-01 06:47:00     ┆ 3     │
└─────────────────────┴─────────────────────┴─────┴───────────────────────┴─────────────────────────┴───────┘

相关问题