Pandas read_csv是否删除重复的标题行？

68bkxrlz 于 2023-06-20 发布在其他

关注(0)|答案(1)|浏览(137)

我在云中有多个csv，我必须以字节的形式下载。这些csv的格式都是相等的，所以我希望总是相同数量的数据。

itemid            timestamp                   y              y_lower
43406  2023-05-29T16:00:00   27.61612174350883   4.7486855702091635
43406  2023-05-29T16:00:00   27.61612174350883   4.7486855702091635
43406  2023-05-29T16:00:00   27.61612174350883   4.7486855702091635
43406  2023-05-29T16:00:00   27.61612174350883   4.7486855702091635
itemid            timestamp                   y              y_lower
43406  2023-05-29T16:00:00   27.61612174350883   4.7486855702091635
43406  2023-05-29T16:00:00   27.61612174350883   4.7486855702091635
itemid            timestamp                   y              y_lower
43406  2023-05-29T16:00:00   27.61612174350883   4.7486855702091635
43406  2023-05-29T16:00:00   27.61612174350883   4.7486855702091635
43406  2023-05-29T16:00:00   27.61612174350883   4.7486855702091635
43406  2023-05-29T16:00:00   27.61612174350883   4.7486855702091635

dataset_bytes_array, dataset_metadata = download_object_directory_bytes(
                dataset_storage.bucket_name, prefix=f'{dataset_storage.object_path}/datasets',
            )
            dataset_bytes_data = b''.join(dataset_bytes_array)

在获得最后一个字节数组后，我用以下方式创建一个Pandas Dataframe ：

dataset_df = pd.read_csv(
                    BytesIO(dataset_bytes_data), on_bad_lines='warn', keep_default_na=False, dtype=object,
                )

我认为on_bad_lines可以帮助我跳过重复的标题行，但这似乎没有发生。是否有一种非常通用的方法来删除重复的标题行？

pandas

来源：https://stackoverflow.com/questions/76433883/pandas-read-csv-dropping-duplicate-header-rows

1条答案

按热度按时间

cwdobuhd1#

首先找到重复标头的索引值，然后删除它们。

df=pd.DataFrame({'itemid':['itemid',1,2,3,'itemid',4,5,6],'timestamp':['timestamp',1,2,3,'timestamp',4,5,6]})
print(df)

    itemid  timestamp
-----------------------
0   itemid  timestamp
1        1          1
2        2          2
3        3          3
4   itemid  timestamp
5        4          4
6        5          5
7        6          6

header_loc=df[df['itemid']=='itemid'].index
df.drop(header_loc,inplace=True)
print(df)

    itemid  timestamp
-----------------------
1        1          1
2        2          2
3        3          3
5        4          4
6        5          5
7        6          6

更新-1：不使用硬编码列名

df[pd.to_numeric(df['itemid'],errors='coerce').notnull()]

赞(0）回复(0）举报 2023-06-20

我来回答

Pandas read_csv是否删除重复的标题行？

1条答案

相关问题

热门标签

最新问答