我如何向量化一个在pandas Periodindex上运行的for循环,我需要将日期时间排序到适当的周期?

q1qsirdb  于 2023-04-18  发布在  其他
关注(0)|答案(2)|浏览(104)

我有一个Dataframe“timeseries”,它的索引是datetime,我有一个PeriodIndex“on”:

import numpy as np
import pandas as pd

timeseries = pd.DataFrame(
        index=pd.DatetimeIndex(
            [
                "2000-01-01 12:00:00Z",
                "2000-01-01 13:00:00Z",
                "2000-01-01 14:00:00Z",
                "2000-01-02 13:00:00Z",
                "2000-01-02 18:00:00Z",
                "2000-01-03 14:00:00Z",
                "2000-01-03 20:00:00Z",
                "2000-01-04 13:00:00Z",
            ]
        ),
        data={
            "value1": [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
        },
    )
on = pd.PeriodIndex(
    ["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"], freq="D"
    )

我想在“timeseries”中添加一列,其中包含每个相应日期时间所在的“on”中的时段:

value1      period
2000-01-01 12:00:00+00:00     6.0  2000-01-01
2000-01-01 13:00:00+00:00     5.0  2000-01-01
2000-01-01 14:00:00+00:00     3.0  2000-01-01
2000-01-02 13:00:00+00:00     7.0  2000-01-02
2000-01-02 18:00:00+00:00     4.0  2000-01-02
2000-01-03 14:00:00+00:00     4.0         NaN
2000-01-03 20:00:00+00:00     5.0         NaN
2000-01-04 13:00:00+00:00     3.0  2000-01-04

到目前为止,我已经用for循环实现了这一点:

timeseries["period"] = np.NaN
    for period in on:
        datetimes_in_period = timeseries.index[
            (timeseries.index >= period.start_time.tz_localize("UTC"))
            & (timeseries.index <= period.end_time.tz_localize("UTC"))
        ]
        timeseries["period"].loc[datetimes_in_period] = period

为了提高效率,我想避免Python中的循环。我如何对这段代码进行向量化?

eit6fx6z

eit6fx6z1#

您仍然可以按照@rorshan的建议使用.merge_asof
如果创建开始/结束间隔的 Dataframe :

df_on = pd.DataFrame({
   "period":     on,
   "start_time": on.start_time.tz_localize("UTC"), 
   "end_time":   on.end_time.tz_localize("UTC"),
})

df = pd.merge_asof(timeseries, df_on, left_index=True, right_on="start_time")

# blank out period when not inside
df["period"] = df["period"].where((df.index >= df["start_time"]) & (df.index <= df["end_time"]))
>>> df[["value1", "period"]]
                           value1      period
2000-01-01 12:00:00+00:00     6.0  2000-01-01
2000-01-01 13:00:00+00:00     5.0  2000-01-01
2000-01-01 14:00:00+00:00     3.0  2000-01-01
2000-01-02 13:00:00+00:00     7.0  2000-01-02
2000-01-02 18:00:00+00:00     4.0  2000-01-02
2000-01-03 14:00:00+00:00     4.0         NaT
2000-01-03 20:00:00+00:00     5.0         NaT
2000-01-04 13:00:00+00:00     3.0  2000-01-04
bakd9h0s

bakd9h0s2#

  • 注:此答案最初涉及一个问题,后来经过编辑以更改意图 *

我的解决方案是:

import pandas as pd

timeseries = pd.DataFrame(
        index=pd.DatetimeIndex(
            [
                "2000-01-01 12:00:00Z",
                "2000-01-01 13:00:00Z",
                "2000-01-01 14:00:00Z",
                "2000-01-02 13:00:00Z",
                "2000-01-02 18:00:00Z",
                "2000-01-03 14:00:00Z",
                "2000-01-03 20:00:00Z",
                "2000-01-04 13:00:00Z",
            ]
        ),
        data={
            "value1": [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
        },
    )
on = pd.PeriodIndex(
    ["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"], freq="D"
    )

merge = (pd.merge_asof(timeseries.index.to_frame(),
                    on.to_timestamp().to_frame(),
                    right_index=True, left_index=True)
                    .drop('0_x', axis=1)
                    .reset_index()
                    .rename({'0_y':'date', 'index':'period'}, axis=1)
        )

#extracting from `on` any date which does not have a matching date in timestamp
unmatched_periods = on.to_timestamp().difference(merge.date).to_frame()
unmatched_periods[0] = pd.NaT

merge = merge.groupby('date').agg(func=lambda x: list(x))
unmatched_periods.columns = merge.columns
merge = pd.concat((merge, unmatched_periods))
merge

我以前从来没有使用过PeriodIndex,被迫用to_timestamp将其转换为DateTimeIndex。从文档中可以看出,PeriodIndex似乎旨在以编程方式创建日期/时段(例如,两天之间的每X天),这似乎与它在这里的用途不太一样。
无论如何,解决方案的核心是使用merge_asof,它类似于merge,但不需要相等的键,它会查找最近的键。默认情况下,它会向后查找,这就是我们想要的(on中最近的日期在timeseries中的日期之前)。
然后我们使用groupbyagg来得到组。
我们还需要获取on中与timeseries(在本例中为2000-01-05)不匹配的日期
注意:你说你为了效率而“避免”循环,理论上这是个好主意,但要注意你试图实现的结果(将列表作为一列中的值)本身是相当低效的,在pandas groupby之上也是相当苛刻的。

相关问题