pandas 在Python中逐行减去csv中的时间

hts6caw3  于 2022-12-02  发布在  Python
关注(0)|答案(2)|浏览(141)

我有一个几千行的CSV文件。它包含从不同设备发送的数据。它们应该经常传输(每10分钟),但有时会有延迟。我正在尝试编写一个程序,将突出显示两个读数之间的延迟大于15分钟的所有示例
我已经编写了一个功能性代码,但使用此代码时,我首先必须手动编辑CSV,将“eventTime”变量从时间格式(例如03:22:00)更改为基于1/24的浮点值(例如03:22:00变为0.14027)。类似地,15分钟间隔变为0.01042(15/(60*24))

import pandas as pd

df = pd.read_csv('file.csv')
df2 = pd.DataFrame()
deviceID = df["deviceId"].unique().tolist()
threshold = 0.01042
for id_no in range(0, len(deviceID)):
    subset = df[df.deviceId == deviceID[id_no]]
    for row in range(len(subset)-1):
        difference = subset.iloc[row, 1] - subset.iloc[row+1, 1]
        if difference > threshold:
            df2 = df2.append(subset.iloc[row])
            df2 = df2.append(subset.iloc[row+1])

df2.to_csv('file2.csv)

这是可行的,我可以在excel中打开CSV,并手动将浮点值改回时间格式,但当我可能要处理几百个CSV文件时,这就变得不切实际了,
我在下面尝试过

import pandas as pd
from datetime import datetime

df = pd.read_csv('file.csv')
df2 = pd.DataFrame()
deviceID = df["deviceId"].unique().tolist()
df['eventTime'].apply(lambda x: datetime.strptime(x, "%H:%M:%S"))
threshold = datetime.strptime("00:15:00", '%H:%M:%S')
for id_no in range(0, len(deviceID)):
    subset = df[df.deviceId == deviceID[id_no]]
    for row in range(len(subset)-1):
        difference = datetime.strptime(subset.iloc[row, 1],'%H:%M:%S') - datetime.strptime(subset.iloc[row+1, 1], '%H:%M:%S')
        if difference > threshold:
            df2 = df2.append(subset.iloc[row])
            df2 = df2.append(subset.iloc[row+1])

df2.to_csv('file2.csv')

但我得到以下错误:

if difference > threshold:
TypeError: '>' not supported between instances of 'datetime.timedelta' and 'datetime.datetime'

数据如下所示:

| eventTime|  deviceId|
| -------- | -------- |
| 15:30:00 |  11234889|
| 15:45:00 |  11234889|
| 16:00:00 |  11234889|

和不同的ID

| eventTime|  deviceId|
| -------- | -------- |
| 15:30:00 |  11234890|
| 15:45:00 |  11234890|
| 16:00:00 |  11234890|
fjaof16o

fjaof16o1#

thresholddatetime,您将其与timedelta对象(difference)进行比较。您的意思是:

from datetime import timedelta
...
threshold = datetime.timedelta(minutes=15)
3htmauhk

3htmauhk2#

Given this dataframe:

actual_ts    id
0  05:00:00  SPAM
1   5:15:00  SPAM
2   5:33:00  SPAM <-- Should highlight
3   5:45:00  SPAM
4   6:02:00  SPAM <-- Should highlight
5  11:15:00   FOO
6  11:32:00   FOO <-- Should highlight
7  11:45:00   FOO
8  12:08:00   FOO <-- Should highlight

This is a step-by-step way of getting to where you want, definitely not the most optimal but it's clear enough to teach you how to avoid looping over dataframes, which is a major no-no. Try running and printing the dataframe every step so you know what's happening.

# Convert column to timedelta.
df["actual_ts"] = pd.to_timedelta(df["actual_ts"])

# Sort as a best practice if not computationally expensive.
df = df.sort_values(by=["id", "actual_ts"])

# Shift the actual_ts by one row per group.
df["lagged_ts"] = df.groupby(["id"])["actual_ts"].shift(1)

# Fill nulls with same time if you want to avoid NaNs and NaTs.
df["lagged_ts"] = df["lagged_ts"].fillna(df["actual_ts"])

# Calculate difference in seconds.
df["diff_seconds"] = (df["actual_ts"] - df["lagged_ts"]).dt.seconds

# Mark as True all events greater than 15 minutes.
df["highlight"] = df["diff_seconds"] > 900

# Keep all columns you need.
new_df = df[["actual_ts", "id", "diff_seconds", "highlight"]]

You get this:

actual_ts    id  diff_seconds  highlight
5 0 days 11:15:00   FOO             0      False
6 0 days 11:32:00   FOO          1020       True
7 0 days 11:45:00   FOO           780      False
8 0 days 12:08:00   FOO          1380       True
0 0 days 05:00:00  SPAM             0      False
1 0 days 05:15:00  SPAM           900      False
2 0 days 05:33:00  SPAM          1080       True
3 0 days 05:45:00  SPAM           720      False
4 0 days 06:02:00  SPAM          1020       True

Cleaning up the 0 days is up to you. You can also change diff_seconds to minutes but that's easy enough.

相关问题