我如何使用Pyspark将doc/docx/docm文件保存到目录或S3存储桶中

dkqlctbz  于 2022-12-04  发布在  Apache
关注(0)|答案(1)|浏览(135)

我尝试将数据框保存到文档中,但它返回以下错误

java.lang.ClassNotFoundException: Failed to find data source: docx. Please find packages at http://spark.apache.org/third-party-projects.html

我的代码如下:

#f_data is my dataframe with data
       f_data.write.format("docx").save("dbfs:/FileStore/test/test.csv")
       display(f_data)

请注意,我可以保存CSV,文本和JSON格式的文件,但有没有办法使用pyspark保存docx文件?
我的问题是,我们是否支持以doc/docx保存数据?
如果没有,是否有任何方法来存储文件,比如将文件流对象写入特定的文件夹/S3桶?

83qze16e

83qze16e1#

In short: no, Spark does not support DOCX format out of the box. You can still collect the data into the driver node (i.e.: pandas dataframe) and work from there.
Long answer: A document format like DOCX is meant for presenting information in small tables with style metadata. Spark focus on processing large amount of files at scale and it does not support DOCX format out of the box.

If you want to write DOCX files programmatically, you can:

  1. Collect the data into a Pandas DataFrame pd_f_data = f_data.toDF()
  2. Import python package to create the DOCX document and save it into a stream. See question: Writing a Python Pandas DataFrame to Word document
  3. Upload the stream to a S3 blob using for example boto: Can you upload to S3 using a stream rather than a local file?
    Note: if your data has more than one hundred rows, ask the receivers how they are going to use the data. Just use docx for reporting no as a file transfer format.

相关问题