csv 如何按行拆分 Dataframe

pu82cl6c 于 2022-12-06 发布在其他

关注(0)|答案(1)|浏览(145)

我有一个包含6.2M条记录的数据集。当我通过分组分割它时，它丢失了大约1.2M条记录。这是数据集的一部分：

VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count |
1        | 2020-01-01 00:28:15  | 2020-01-01 00:33:03   |  1.0  
1        | 2020-01-01 00:35:39  | 2020-01-01 00:43:04   |  1.0  
..       |....................  | ...................   |  ...
1        | 2020-01-31 00:47:41  | 2020-01-31 00:53:52   |  1.0  
1        | 2020-01-31 00:55:23  | 2020-01-31 01:00:14   |  1.0  
2        | 2020-01-31 00:01:58  | 2020-01-31 00:04:16   |  1.0

我需要在tpep_dropoff_datetime列中按天拆分它。这是我用来做这件事的代码，但正如我之前提到的，它不能正常工作。

for date, g in df.groupby(pd.to_datetime(df['tpep_dropoff_datetime']).dt.normalize().astype(str)):
    g.to_csv(f'{date}.csv', index=False)

有什么想法，如何分割 Dataframe ？

csv

来源：https://stackoverflow.com/questions/74348157/how-to-split-dataframe-column-by-row

1条答案

按热度按时间

liwlm1x91#

你可以试试这个，虽然我相信这可能不是最好的方法（Pandas可能有更好的方法来做这件事）。

import pandas as pd
cols = ["VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime", "passenger_count"]
df = pd.DataFrame(
    [[1, "2020-01-01 00:28:15", "2020-01-01 00:33:03", 1.0],
     [1, "2020-01-01 00:35:39", "2020-01-01 00:43:04", 1.0],
     [1, "2020-01-31 00:47:41", "2020-01-31 00:53:52", 1.0],
     [1, "2020-01-31 00:55:23", "2020-01-31 01:00:14", 1.0],
    ],
    columns=cols,
)
# I do this because of the example the date is a string and I'm changing it to datetime.
# This might not be necesary, depends on your data.
df["tpep_dropoff_datetime"] = pd.to_datetime(df['tpep_dropoff_datetime'], format="%Y-%m-%d %H:%M:%S")
# Create a new column named "my_date" which
# will contains the date from the column "tpep_dropoff_datetime"
df["my_date"] = df["tpep_dropoff_datetime"].dt.date

# Now we group by date al the rows, and copy the ones according to their index
for date, indexes in df.groupby('my_date').groups.items():
    print(f"date: {date}")
    print(f"indexes: {indexes}")
    # Copying the rows I want according to the index
    aux_df = df.loc[indexes]
    print(aux_df)
    # Exporting to csv only the columns I want
    aux_df.to_csv(f"{date}.csv", columns=cols, index=False)

输出为文件，控制台中显示以下文件：

date: 2020-01-01
indexes: Int64Index([0, 1], dtype='int64')
   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count     my_date
0         1  2020-01-01 00:28:15   2020-01-01 00:33:03              1.0  2020-01-01
1         1  2020-01-01 00:35:39   2020-01-01 00:43:04              1.0  2020-01-01
date: 2020-01-31
indexes: Int64Index([2, 3], dtype='int64')
   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count     my_date
2         1  2020-01-31 00:47:41   2020-01-31 00:53:52              1.0  2020-01-31
3         1  2020-01-31 00:55:23   2020-01-31 01:00:14              1.0  2020-01-31

有了这个至少我会确保我得到的日期正确，但可能不是最好的效率

赞(0）回复(0）举报 2022-12-06

我来回答

csv 如何按行拆分 Dataframe

1条答案

相关问题

热门标签

最新问答