我的临时pandas Dataframe 中有2个pandas列,我希望按日期和时间计算唯一主机,因为在同一天,在我的 Dataframe 中,日志文件如下所示
10.216.113.172 - - [04/Sep/2009:02:57:16 -0700] "GET /images/filmpics/0000/0053/quietman2.jpeg HTTP/1.1" 200 1077924
10.211.47.159 - - [03/Sep/2009:22:19:49 -0700] "GET /quietman4.jpeg HTTP/1.1" 404 212
10.211.47.159 - - [22/Aug/2009:12:58:27 -0700] "GET /assets/img/closelabel.gif HTTP/1.1" 304 -
10.216.113.172 - - [14/Jan/2010:03:09:17 -0800] "GET /images/filmmediablock/229/Shinjuku5.jpg HTTP/1.1" 200 443005
10.211.47.159 - - [15/Oct/2009:21:21:58 -0700] "GET /assets/img/banner/ten-years-banner-grey.jpg HTTP/1.1" 304 -
10.216.113.172 - - [12/Aug/2009:05:57:55 -0700] "GET /about-us/people/ HTTP/1.1" 200 10773
10.211.47.159 - - [24/Aug/2009:13:16:26 -0700] "GET /assets/img/search-button.gif HTTP/1.1" 304 -
10.211.47.159 - - [03/Sep/2009:21:30:27 -0700] "GET /images/newspics/0000/0017/Mike5_thumb.JPG HTTP/1.1" 304 -
10.211.47.159 - - [15/Oct/2009:20:30:43 -0700] "GET /images/filmpics/0000/0057/quietman4.jpeg HTTP/1.1" 304 -
10.211.47.159 - - [11/Aug/2009:20:34:44 -0700] "GET /assets/img/search-button.gif HTTP/1.1" 304 -
我的 Dataframe 临时中有这些时间戳(03/Sep/2009:22:19:49 - 0700)和主机(10.211.47.159),预期输出为[2,1,1,2,1,1],-0700是我们必须减去的时间,如果返回一天,时间必须推回1。以下是我的代码
but my output is [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] may someone help
temp = pandas.DataFrame()
temp['timestamp'] = Mainpanda['timestamp']
temp['host'] = Mainpanda['host']
temp['timestamp'] =
pandas.to_datetime(temp['timestamp'],
format='%d/%b/%Y:%H:%M:%S %z')
temp['timestamp'] = temp['timestamp'] - pandas.Timedelta(hours=7)
counts = temp.groupby('timestamp')['host'].nunique().reset_index()
counts = counts.sort_values(by='timestamp')
counts = counts['host'].tolist()
print(counts)
1条答案
按热度按时间wfauudbj1#
首先转换为DatetimeIndex,然后将所有日期时间与UTC对齐:使用相同的时区对正确处理数据非常重要:
输出: