pandas 如何转换的日期时间,而工作的大数据?

ttvkxqim  于 2022-12-31  发布在  其他
关注(0)|答案(3)|浏览(100)

enter image description here我正在使用Colab,并尝试分离出过去2个月数据的测试集,但我遇到了此错误(ValueError:两个日期必须有相同的UTC偏移量),我知道错误是因为集合的开始日期是BST,结束日期是GMT。

latest_df = df.loc['Sat 01 Oct 2022 12:00:03 AM BST':'Thu 01 Dec 2022 10:02:02 AM GMT']

最新文件头()
我尝试在数据集的excel上手动转换时间,但转换所有日期需要很长时间,因为它是一个大数据。

zpqajqem

zpqajqem1#

你可以使用pytz库把日期转换成相同的时区,下面是一个例子:

import pytz

# Set the timezone for the start and end dates
start_tz = pytz.timezone('Europe/London')
end_tz = pytz.timezone('Europe/London')

# Convert the start and end dates to the same timezone
start_date = start_tz.localize(df['Sat 01 Oct 2022 12:00:03 AM BST'])
end_date = end_tz.localize(df['Thu 01 Dec 2022 10:02:02 AM GMT'])

# Select the rows between the start and end dates
latest_df = df.loc[start_date:end_date]
latest_df.head()
eoigrqb6

eoigrqb62#

因为我不知道列的名称,所以假设它们是A到F。您可以在代码中用列名替换它们:

import random
import pandas as pd
import numpy as np
import datetime
import pytz

# Create some sample data for testing
data = [
    'Sat 01 Oct 2022 12:00:03 AM BST',
    'Sat 01 Oct 2022 11:00:03 AM BST',
    'Sat 01 Oct 2022 10:00:03 AM BST',
    'Thu 01 Dec 2022 9:02:02 AM GMT',
    'Thu 01 Dec 2022 8:02:02 AM GMT',
    'Thu 01 Dec 2022 7:02:02 AM GMT'
]

df = pd.DataFrame(
    {
        "A": pd.Series(data),
        "B": pd.Series(np.random.randint(0,100,size=(6,))),
        "C": pd.Series(np.random.randint(0,100,size=(6,))),
        "D": pd.Series(np.random.randint(0,100,size=(6,))),
        "E": pd.Series(np.random.randint(0,100,size=(6,))),
        "F": pd.Series(np.random.randint(0,100,size=(6,)))
    })

# Create a new column of offsets, sclicing the datetime
df["offset"] = df.A.apply(lambda x: x[-3:])

# Convert the format of dates to standard datetime format
df["A"] = df.A.apply(lambda x: pd.to_datetime(x, infer_datetime_format=True))

>>> df

输出:

A         B  C   D   E   F   offset
0   2022-10-01 00:00:03 60  39  66  49  31  BST
1   2022-10-01 11:00:03 25  87  42  74  39  BST
2   2022-10-01 10:00:03 82  95  36  45  30  BST
3   2022-12-01 09:02:02 27  21  44  58  74  GMT
4   2022-12-01 08:02:02 33  38  23  97  57  GMT
5   2022-12-01 07:02:02 53  42  32  67  95  GMT

我编写了一个自定义函数来转换时区,并将其应用于 Dataframe :

# Write a function to change the timezones to UCT/GMT
def convert_datetime_timezone(dt, tz1, tz2="UCT"):
     
    """
    dt: date time string
    tz1: initial time zone, defualt=UCT
    tz2: target time zone
       """
    if tz1 == "BST":
        tz1 = pytz.timezone("Europe/London")
        tz2 = pytz.timezone(tz2)

        # dt = datetime.datetime.strptime(dt,"%Y-%m-%d %H:%M:%S")
        dt = tz1.localize(dt)
        dt = dt.astimezone(tz2)
        dt = dt.strftime("%Y-%m-%d %H:%M:%S")
        converted_dt = pd.to_datetime(dt)
        return converted_dt
    else:
        return dt

# Apply the function and drop the offset column
df["A"] = df.apply(lambda x: convert_datetime_timezone(x["A"], x["offset"]), axis=1)
df.drop("offset", axis=1, inplace=True)

# Set your datetime as index so that you can use loc to target a date range
df.set_index("A", drop=True, inplace=True)
df.loc["2022-10-01 00:00:03":"2022-10-01 10:00:03",:]

输出:

B   C   D   E   F
A                   
2022-10-01 00:00:03 60  39  66  49  31
2022-10-01 10:00:03 82  95  36  45  30
gcuhipw9

gcuhipw93#

您可以简单地转换start_date时区,而不是转换整个数据。

相关问题