我有一个几千行的CSV文件。它包含从不同设备发送的数据。它们应该经常传输(每10分钟),但有时会有延迟。我正在尝试编写一个程序,将突出显示两个读数之间的延迟大于15分钟的所有示例
我已经编写了一个功能性代码,但使用此代码时,我首先必须手动编辑CSV,将“eventTime”变量从时间格式(例如03:22:00)更改为基于1/24的浮点值(例如03:22:00变为0.14027)。类似地,15分钟间隔变为0.01042(15/(60*24))
import pandas as pd
df = pd.read_csv('file.csv')
df2 = pd.DataFrame()
deviceID = df["deviceId"].unique().tolist()
threshold = 0.01042
for id_no in range(0, len(deviceID)):
subset = df[df.deviceId == deviceID[id_no]]
for row in range(len(subset)-1):
difference = subset.iloc[row, 1] - subset.iloc[row+1, 1]
if difference > threshold:
df2 = df2.append(subset.iloc[row])
df2 = df2.append(subset.iloc[row+1])
df2.to_csv('file2.csv)
这是可行的,我可以在excel中打开CSV,并手动将浮点值改回时间格式,但当我可能要处理几百个CSV文件时,这就变得不切实际了,
我在下面尝试过
import pandas as pd
from datetime import datetime
df = pd.read_csv('file.csv')
df2 = pd.DataFrame()
deviceID = df["deviceId"].unique().tolist()
df['eventTime'].apply(lambda x: datetime.strptime(x, "%H:%M:%S"))
threshold = datetime.strptime("00:15:00", '%H:%M:%S')
for id_no in range(0, len(deviceID)):
subset = df[df.deviceId == deviceID[id_no]]
for row in range(len(subset)-1):
difference = datetime.strptime(subset.iloc[row, 1],'%H:%M:%S') - datetime.strptime(subset.iloc[row+1, 1], '%H:%M:%S')
if difference > threshold:
df2 = df2.append(subset.iloc[row])
df2 = df2.append(subset.iloc[row+1])
df2.to_csv('file2.csv')
但我得到以下错误:
if difference > threshold:
TypeError: '>' not supported between instances of 'datetime.timedelta' and 'datetime.datetime'
数据如下所示:
| eventTime| deviceId|
| -------- | -------- |
| 15:30:00 | 11234889|
| 15:45:00 | 11234889|
| 16:00:00 | 11234889|
和不同的ID
| eventTime| deviceId|
| -------- | -------- |
| 15:30:00 | 11234890|
| 15:45:00 | 11234890|
| 16:00:00 | 11234890|
2条答案
按热度按时间fjaof16o1#
threshold
是datetime
,您将其与timedelta
对象(difference
)进行比较。您的意思是:3htmauhk2#
Given this dataframe:
This is a step-by-step way of getting to where you want, definitely not the most optimal but it's clear enough to teach you how to avoid looping over dataframes, which is a major no-no. Try running and printing the dataframe every step so you know what's happening.
You get this:
Cleaning up the
0 days
is up to you. You can also changediff_seconds
to minutes but that's easy enough.