csv 如何按行拆分 Dataframe

pu82cl6c  于 2022-12-06  发布在  其他
关注(0)|答案(1)|浏览(145)

我有一个包含6.2M条记录的数据集。当我通过分组分割它时,它丢失了大约1.2M条记录。这是数据集的一部分:

VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count |
1        | 2020-01-01 00:28:15  | 2020-01-01 00:33:03   |  1.0  
1        | 2020-01-01 00:35:39  | 2020-01-01 00:43:04   |  1.0  
..       |....................  | ...................   |  ...
1        | 2020-01-31 00:47:41  | 2020-01-31 00:53:52   |  1.0  
1        | 2020-01-31 00:55:23  | 2020-01-31 01:00:14   |  1.0  
2        | 2020-01-31 00:01:58  | 2020-01-31 00:04:16   |  1.0

我需要在tpep_dropoff_datetime列中按天拆分它。这是我用来做这件事的代码,但正如我之前提到的,它不能正常工作。

for date, g in df.groupby(pd.to_datetime(df['tpep_dropoff_datetime']).dt.normalize().astype(str)):
    g.to_csv(f'{date}.csv', index=False)

有什么想法,如何分割 Dataframe ?

liwlm1x9

liwlm1x91#

你可以试试这个,虽然我相信这可能不是最好的方法(Pandas可能有更好的方法来做这件事)。

import pandas as pd
cols = ["VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime", "passenger_count"]
df = pd.DataFrame(
    [[1, "2020-01-01 00:28:15", "2020-01-01 00:33:03", 1.0],
     [1, "2020-01-01 00:35:39", "2020-01-01 00:43:04", 1.0],
     [1, "2020-01-31 00:47:41", "2020-01-31 00:53:52", 1.0],
     [1, "2020-01-31 00:55:23", "2020-01-31 01:00:14", 1.0],
    ],
    columns=cols,
)
# I do this because of the example the date is a string and I'm changing it to datetime.
# This might not be necesary, depends on your data.
df["tpep_dropoff_datetime"] = pd.to_datetime(df['tpep_dropoff_datetime'], format="%Y-%m-%d %H:%M:%S")
# Create a new column named "my_date" which
# will contains the date from the column "tpep_dropoff_datetime"
df["my_date"] = df["tpep_dropoff_datetime"].dt.date

# Now we group by date al the rows, and copy the ones according to their index
for date, indexes in df.groupby('my_date').groups.items():
    print(f"date: {date}")
    print(f"indexes: {indexes}")
    # Copying the rows I want according to the index
    aux_df = df.loc[indexes]
    print(aux_df)
    # Exporting to csv only the columns I want
    aux_df.to_csv(f"{date}.csv", columns=cols, index=False)

输出为文件,控制台中显示以下文件:

date: 2020-01-01
indexes: Int64Index([0, 1], dtype='int64')
   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count     my_date
0         1  2020-01-01 00:28:15   2020-01-01 00:33:03              1.0  2020-01-01
1         1  2020-01-01 00:35:39   2020-01-01 00:43:04              1.0  2020-01-01
date: 2020-01-31
indexes: Int64Index([2, 3], dtype='int64')
   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count     my_date
2         1  2020-01-31 00:47:41   2020-01-31 00:53:52              1.0  2020-01-31
3         1  2020-01-31 00:55:23   2020-01-31 01:00:14              1.0  2020-01-31

有了这个至少我会确保我得到的日期正确,但可能不是最好的效率

相关问题