python-3.x 使用pyarrow对panda Dataframe 进行分区并保存为parquet文件时，数据类型不会保留

nqwrtyyt 于 2023-02-06 发布在 Python

关注(0)|答案(2)|浏览(218)

使用pyarrow对Pandas数据框进行分区并保存为parquet文件时，数据类型不被保留。

- 案例1：保存分区数据集-不保留数据类型**

# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame({'age': [77,32,234],'name':['agan','bbobby','test'] })
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False)

# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

- 输出：**

Datatypes before saving the dataset
age      int64
name    object
dtype: object

Datatypes after loading the dataset
name      object
age     category
dtype: object

- 案例2：未分区数据集-保留数据类型**

import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow')
df = pd.DataFrame({'age': [77,32,234],'name':['agan','bbobby','test'] })
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)

# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

- 输出**：

Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow
Datatypes before saving the dataset
age      int64
name    object
dtype: object

Datatypes after loading the dataset
age      int64
name    object
dtype: object

python-3.x

来源：https://stackoverflow.com/questions/57308349/datatypes-are-not-preserved-when-a-pandas-dataframe-partitioned-and-saved-as-par

2条答案

按热度按时间

nwlqm0z11#

没有明显的方法可以做到这一点。请参考下面的JIRA问题。
https://issues.apache.org/jira/browse/ARROW-6114

赞(0）回复(0）举报 2023-02-06

v7pvogib2#

你可以试试这个：

import pyarrow as pa
import pyarrow.parquet as pq

# Convert DataFrame to Apache Arrow Table
table = pa.Table.from_pandas(df)

# Parquet with Brotli compression
pq.write_table(table, 'file_name.parquet')

赞(0）回复(0）举报 2023-02-06

我来回答

python-3.x 使用pyarrow对panda Dataframe 进行分区并保存为parquet文件时，数据类型不会保留

2条答案

相关问题

热门标签

最新问答