pandas 我想将唯一主机计数到时间戳之外

qfe3c7zg  于 2023-02-07  发布在  其他
关注(0)|答案(1)|浏览(66)

我的临时pandas Dataframe 中有2个pandas列,我希望按日期和时间计算唯一主机,因为在同一天,在我的 Dataframe 中,日志文件如下所示

10.216.113.172 - - [04/Sep/2009:02:57:16 -0700] "GET /images/filmpics/0000/0053/quietman2.jpeg HTTP/1.1" 200 1077924
10.211.47.159 - - [03/Sep/2009:22:19:49 -0700] "GET /quietman4.jpeg HTTP/1.1" 404 212
10.211.47.159 - - [22/Aug/2009:12:58:27 -0700] "GET /assets/img/closelabel.gif HTTP/1.1" 304 -
10.216.113.172 - - [14/Jan/2010:03:09:17 -0800] "GET /images/filmmediablock/229/Shinjuku5.jpg HTTP/1.1" 200 443005
10.211.47.159 - - [15/Oct/2009:21:21:58 -0700] "GET /assets/img/banner/ten-years-banner-grey.jpg HTTP/1.1" 304 -
10.216.113.172 - - [12/Aug/2009:05:57:55 -0700] "GET /about-us/people/ HTTP/1.1" 200 10773
10.211.47.159 - - [24/Aug/2009:13:16:26 -0700] "GET /assets/img/search-button.gif HTTP/1.1" 304 -
10.211.47.159 - - [03/Sep/2009:21:30:27 -0700] "GET /images/newspics/0000/0017/Mike5_thumb.JPG HTTP/1.1" 304 -
10.211.47.159 - - [15/Oct/2009:20:30:43 -0700] "GET /images/filmpics/0000/0057/quietman4.jpeg HTTP/1.1" 304 -
10.211.47.159 - - [11/Aug/2009:20:34:44 -0700] "GET /assets/img/search-button.gif HTTP/1.1" 304 -

我的 Dataframe 临时中有这些时间戳(03/Sep/2009:22:19:49 - 0700)和主机(10.211.47.159),预期输出为[2,1,1,2,1,1],-0700是我们必须减去的时间,如果返回一天,时间必须推回1。以下是我的代码

but my output is [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] may someone help
temp = pandas.DataFrame()
temp['timestamp'] = Mainpanda['timestamp']
temp['host'] = Mainpanda['host']

temp['timestamp'] = 
pandas.to_datetime(temp['timestamp'], 
format='%d/%b/%Y:%H:%M:%S %z')
temp['timestamp'] = temp['timestamp'] - pandas.Timedelta(hours=7)

counts = temp.groupby('timestamp')['host'].nunique().reset_index()

counts = counts.sort_values(by='timestamp')

counts = counts['host'].tolist()

print(counts)
wfauudbj

wfauudbj1#

首先转换为DatetimeIndex,然后将所有日期时间与UTC对齐:使用相同的时区对正确处理数据非常重要:

import pandas as pd

temp['timestamp'] = pd.to_datetime(temp['timestamp'], format='%d/%b/%Y:%H:%M:%S %z')
temp['ts_utc'] = pd.DatetimeIndex([dt.tz_convert('UTC') for dt in temp['timestamp']])

visits = (temp.groupby(['host', pd.Grouper(key='ts_utc', freq='D')]).size()
              .to_frame('visit').reset_index())

输出:

>>> visits

             host                    ts_utc  visit
0   10.211.47.159 2009-08-12 00:00:00+00:00      1
1   10.211.47.159 2009-08-22 00:00:00+00:00      1
2   10.211.47.159 2009-08-24 00:00:00+00:00      1
3   10.211.47.159 2009-09-04 00:00:00+00:00      2
4   10.211.47.159 2009-10-16 00:00:00+00:00      2
5  10.216.113.172 2009-08-12 00:00:00+00:00      1
6  10.216.113.172 2009-09-04 00:00:00+00:00      1
7  10.216.113.172 2010-01-14 00:00:00+00:00      1

相关问题