apache panda将 Dataframe 写入parquet格式(带附加)

qv7cva1a 于 2022-11-16 发布在 Apache

关注(0)|答案(6)|浏览(215)

我尝试在append模式下将pandas dataframe写入parquet文件格式（在最新的panda版本0.21.0中引入）。然而，文件没有被追加到现有文件，而是被新数据覆盖。我错过了什么？
写入语法为

df.to_parquet(path, mode='append')

读取语法为

pd.read_parquet(path)

apache

来源：https://stackoverflow.com/questions/47191675/pandas-write-dataframe-to-parquet-format-with-append

6条答案

按热度按时间

dy2hfwbg1#

要附加，请执行以下操作：

import pandas as pd 
import pyarrow.parquet as pq
import pyarrow as pa

dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"

# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)

# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)

这将自动追加到表中。

赞(0）回复(0）举报 2022-11-16

xyhw6mcr2#

我用了aws wrangler库。它工作起来很有魅力

以下是参考文档

https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html
我已经从kinesis流读取并使用kinesis-python库来消费消息并写入s3。json的处理逻辑我没有包括在内，因为这篇文章处理的是无法将数据附加到s3的问题。

下面是我使用的示例代码：

!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
    wr.s3.to_parquet(
    df=evet_data,
    path=s3_path,
    dataset=True,
    partition_cols=['e','f'],
    mode="append",
    database="wat_q4_stg",
    table="raw_data_v3",
    catalog_versioning=True  # Optional
    )
    print("write successful")       
except Exception as e:
    print(str(e))

任何澄清都是有帮助的。在几篇文章中，我读过数据并再次覆盖。但随着数据变得越来越大，这会减慢过程。这是低效的

赞(0）回复(0）举报 2022-11-16

qacovj5a3#

在pandas.to_parquet()中没有附加模式。你可以做的是读取现有的文件，修改它，然后写回覆盖它。

赞(0）回复(0）举报 2022-11-16

mkshixfv4#

看起来可以使用 fastparquet 将行组追加到已经存在的 parquet 文件中。这是一个相当独特的特性，因为大多数库都没有这个实现。
下图来自 pandas doc ：

DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)

中的每一个
我们必须把发动机和燃料箱都交上去。

引擎
- - kwargs - 传递给镶木地板库的附加参数。
- 夸格 - - 这里我们需要经过的是：* * append = True * * （来自快速镶木地板）

import pandas as pd
import os.path

file_path = "D:\\dev\\output.parquet"
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
if not os.path.isfile(file_path):
  df.to_parquet(file_path, engine='fastparquet')
else:
  df.to_parquet(file_path, engine='fastparquet', append=True)

格式
如果 append 设置为 True 并且文件不存在，则会出现以下错误

AttributeError: 'ParquetFile' object has no attribute 'fmd'

格式
运行上述脚本 3 次，我在 parquet 文件中有以下数据。

如果我检查元数据，我可以看到这产生了 3 个行组。

- 备注： * *

如果您写入太多的小数据列群组，附加的效率可能会很低。通常建议的数据列群组大小接近 100，000 或 1，000，000 个数据列。这比非常小的数据列群组有一些优点。压缩的效果会更好，因为压缩只会在数据列群组内运作。而且，因为每个数据列群组都会储存自己的统计数据，所以储存统计数据的负担也会更少。

赞(0）回复(0）举报 2022-11-16

91zkwejq5#

使用 fastparquet 写入功能

from fastparquet import write

write(file_name, df, append=True)

中的每一个
据我所知，该文件肯定已经存在。
API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write

赞(0）回复(0）举报 2022-11-16

omhiaaxx6#

Pandas to_parquet()既可以处理单个文件，也可以处理包含多个文件的目录。如果文件已经存在，Pandas会自动覆盖该文件。要附加到一个 parquet 对象，只需在同一个 parquet 目录中添加一个新文件。

os.makedirs(path, exist_ok=True)

# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))

# read
pd.read_parquet(path)

赞(0）回复(0）举报 2022-11-16

我来回答

apache panda将 Dataframe 写入parquet格式(带附加)

6条答案

相关问题

热门标签

最新问答