我如何向量化一个在pandas Periodindex上运行的for循环，我需要将日期时间排序到适当的周期？

q1qsirdb 于 2023-04-18 发布在其他

关注(0)|答案(2)|浏览(104)

我有一个Dataframe“timeseries”，它的索引是datetime，我有一个PeriodIndex“on”：

import numpy as np
import pandas as pd

timeseries = pd.DataFrame(
        index=pd.DatetimeIndex(
            [
                "2000-01-01 12:00:00Z",
                "2000-01-01 13:00:00Z",
                "2000-01-01 14:00:00Z",
                "2000-01-02 13:00:00Z",
                "2000-01-02 18:00:00Z",
                "2000-01-03 14:00:00Z",
                "2000-01-03 20:00:00Z",
                "2000-01-04 13:00:00Z",
            ]
        ),
        data={
            "value1": [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
        },
    )
on = pd.PeriodIndex(
    ["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"], freq="D"
    )

我想在“timeseries”中添加一列，其中包含每个相应日期时间所在的“on”中的时段：

value1      period
2000-01-01 12:00:00+00:00     6.0  2000-01-01
2000-01-01 13:00:00+00:00     5.0  2000-01-01
2000-01-01 14:00:00+00:00     3.0  2000-01-01
2000-01-02 13:00:00+00:00     7.0  2000-01-02
2000-01-02 18:00:00+00:00     4.0  2000-01-02
2000-01-03 14:00:00+00:00     4.0         NaN
2000-01-03 20:00:00+00:00     5.0         NaN
2000-01-04 13:00:00+00:00     3.0  2000-01-04

到目前为止，我已经用for循环实现了这一点：

timeseries["period"] = np.NaN
    for period in on:
        datetimes_in_period = timeseries.index[
            (timeseries.index >= period.start_time.tz_localize("UTC"))
            & (timeseries.index <= period.end_time.tz_localize("UTC"))
        ]
        timeseries["period"].loc[datetimes_in_period] = period

为了提高效率，我想避免Python中的循环。我如何对这段代码进行向量化？

pandas

来源：https://stackoverflow.com/questions/76033370/how-can-i-vectorize-a-for-loop-running-over-a-pandas-periodindex-where-i-need-to

2条答案

按热度按时间

eit6fx6z1#

您仍然可以按照@rorshan的建议使用.merge_asof。
如果创建开始/结束间隔的 Dataframe ：

df_on = pd.DataFrame({
   "period":     on,
   "start_time": on.start_time.tz_localize("UTC"), 
   "end_time":   on.end_time.tz_localize("UTC"),
})

df = pd.merge_asof(timeseries, df_on, left_index=True, right_on="start_time")

# blank out period when not inside
df["period"] = df["period"].where((df.index >= df["start_time"]) & (df.index <= df["end_time"]))

>>> df[["value1", "period"]]
                           value1      period
2000-01-01 12:00:00+00:00     6.0  2000-01-01
2000-01-01 13:00:00+00:00     5.0  2000-01-01
2000-01-01 14:00:00+00:00     3.0  2000-01-01
2000-01-02 13:00:00+00:00     7.0  2000-01-02
2000-01-02 18:00:00+00:00     4.0  2000-01-02
2000-01-03 14:00:00+00:00     4.0         NaT
2000-01-03 20:00:00+00:00     5.0         NaT
2000-01-04 13:00:00+00:00     3.0  2000-01-04

赞(0）回复(0）举报 2023-04-18

bakd9h0s2#

注：此答案最初涉及一个问题，后来经过编辑以更改意图 *

我的解决方案是：

import pandas as pd

timeseries = pd.DataFrame(
        index=pd.DatetimeIndex(
            [
                "2000-01-01 12:00:00Z",
                "2000-01-01 13:00:00Z",
                "2000-01-01 14:00:00Z",
                "2000-01-02 13:00:00Z",
                "2000-01-02 18:00:00Z",
                "2000-01-03 14:00:00Z",
                "2000-01-03 20:00:00Z",
                "2000-01-04 13:00:00Z",
            ]
        ),
        data={
            "value1": [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
        },
    )
on = pd.PeriodIndex(
    ["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"], freq="D"
    )

merge = (pd.merge_asof(timeseries.index.to_frame(),
                    on.to_timestamp().to_frame(),
                    right_index=True, left_index=True)
                    .drop('0_x', axis=1)
                    .reset_index()
                    .rename({'0_y':'date', 'index':'period'}, axis=1)
        )

#extracting from `on` any date which does not have a matching date in timestamp
unmatched_periods = on.to_timestamp().difference(merge.date).to_frame()
unmatched_periods[0] = pd.NaT

merge = merge.groupby('date').agg(func=lambda x: list(x))
unmatched_periods.columns = merge.columns
merge = pd.concat((merge, unmatched_periods))
merge

我以前从来没有使用过PeriodIndex，被迫用to_timestamp将其转换为DateTimeIndex。从文档中可以看出，PeriodIndex似乎旨在以编程方式创建日期/时段（例如，两天之间的每X天），这似乎与它在这里的用途不太一样。
无论如何，解决方案的核心是使用merge_asof，它类似于merge，但不需要相等的键，它会查找最近的键。默认情况下，它会向后查找，这就是我们想要的（on中最近的日期在timeseries中的日期之前）。
然后我们使用groupby和agg来得到组。
我们还需要获取on中与timeseries（在本例中为2000-01-05）不匹配的日期
注意：你说你为了效率而“避免”循环，理论上这是个好主意，但要注意你试图实现的结果（将列表作为一列中的值）本身是相当低效的，在pandas groupby之上也是相当苛刻的。

赞(0）回复(0）举报 2023-04-18

我来回答

我如何向量化一个在pandas Periodindex上运行的for循环，我需要将日期时间排序到适当的周期？

2条答案

相关问题

热门标签

最新问答