获取Pandas时间序列中的下一个元素

jexiocij 于 2023-01-19 发布在其他

关注(0)|答案(2)|浏览(113)

我有一个这样的 Dataframe ，我想创建一个新列next_domain。
它是通过时间戳查找IP的下一个域来计算的。如果该域是某个IP的最后一个域，它是N/A。我怎么在Pandas中做到这一点？

- 输入：**

domain      ip      timestamp
0   Google      101     2020-04-01 23:01:41
1   Google      101     2020-04-01 23:01:59
2   Google      101     2020-04-02 12:01:41
3   Facebook    101     2020-04-02 13:11:33
4   Facebook    101     2020-04-02 13:11:35
5   Youtube     103     2020-04-21 13:01:41
6   Youtube     103     2020-04-21 13:11:46
7   Youtube     103     2020-04-22 01:01:01
8   Google      103     2020-04-22 02:11:23
9   Facebook    103     2020-04-23 14:11:13
10  Youtube     103     2020-04-23 14:11:55

- 预期产出：**

在下表中，第2行的switch = 1，因为它在同一IP之后立即切换到Facebook（如时间戳所示）。
第7行是一个开关，因为Youtube更改为Google的IP 103，第8行是一个开关，因为Google更改为Facebook的IP 103，第10行不是一个开关，因为没有域后，Youtube。

domain      ip      timestamp              next_domain
0   Google      101     2020-04-01 23:01:41    Facebook
1   Google      101     2020-04-01 23:01:59    Facebook
2   Google      101     2020-04-02 12:01:41    Facebook
3   Facebook    101     2020-04-02 13:11:33    N/A
4   Facebook    101     2020-04-02 13:11:35    N/A
5   Youtube     103     2020-04-21 13:01:41    Google
6   Youtube     103     2020-04-21 13:11:46    Google
7   Youtube     103     2020-04-22 01:01:01    Google
8   Google      103     2020-04-22 02:11:23    Facebook
9   Facebook    103     2020-04-23 14:11:13    Youtube
10  Youtube     103     2020-04-23 14:11:55    N/A

pandas

来源：https://stackoverflow.com/questions/70788353/getting-next-element-in-time-series-pandas

2条答案

按热度按时间

wmvff8tz1#

您可以保留每个拉伸的第一个域，每组bfill和shift：

s = df['domain']
df['next_domain'] = (s.where(s.ne(s.shift())) # keep only first domain of each stretch
                      .groupby(df['ip'])                    # per group
                      .apply(lambda s: s.bfill().shift(-1)) # bfill and shift up
                    )

输出：

domain   ip            timestamp next_domain
0     Google  101  2020-04-01 23:01:41    Facebook
1     Google  101  2020-04-01 23:01:59    Facebook
2     Google  101  2020-04-02 12:01:41    Facebook
3   Facebook  101  2020-04-02 13:11:33         NaN
4   Facebook  101  2020-04-02 13:11:35         NaN
5    Youtube  103  2020-04-21 13:01:41      Google
6    Youtube  103  2020-04-21 13:11:46      Google
7    Youtube  103  2020-04-22 01:01:01      Google
8     Google  103  2020-04-22 02:11:23    Facebook
9   Facebook  103  2020-04-23 14:11:13     Youtube
10   Youtube  103  2020-04-23 14:11:55         NaN

赞(0）回复(0）举报 2023-01-19

bvhaajcl2#

def function1(dd:pd.DataFrame):
    dd1=dd.assign(col1=dd.domain.ne(dd.domain.shift()).cumsum())
    dd2=dd1.drop_duplicates(subset='col1',keep='last').sort_values('timestamp')
    dd3=dd2.assign(next_domain=dd2.domain.shift(-1))
    return dd1.join(dd3.set_index('col1')['next_domain'],on='col1').drop("col1",axis=1)

df1.groupby('ip').apply(function1)

输出：

domain   ip            timestamp next_domain
0     Google  101  2020-04-01 23:01:41    Facebook
1     Google  101  2020-04-01 23:01:59    Facebook
2     Google  101  2020-04-02 12:01:41    Facebook
3   Facebook  101  2020-04-02 13:11:33        None
4   Facebook  101  2020-04-02 13:11:35        None
5    Youtube  103  2020-04-21 13:01:41      Google
6    Youtube  103  2020-04-21 13:11:46      Google
7    Youtube  103  2020-04-22 01:01:01      Google
8     Google  103  2020-04-22 02:11:23    Facebook
9   Facebook  103  2020-04-23 14:11:13     Youtube
10   Youtube  103  2020-04-23 14:11:55        None

赞(0）回复(0）举报 2023-01-19

我来回答

获取Pandas时间序列中的下一个元素

2条答案

相关问题

热门标签

最新问答