我尝试将数据框保存到文档中,但它返回以下错误
java.lang.ClassNotFoundException: Failed to find data source: docx. Please find packages at http://spark.apache.org/third-party-projects.html
我的代码如下:
#f_data is my dataframe with data
f_data.write.format("docx").save("dbfs:/FileStore/test/test.csv")
display(f_data)
请注意,我可以保存CSV,文本和JSON格式的文件,但有没有办法使用pyspark保存docx文件?
我的问题是,我们是否支持以doc/docx保存数据?
如果没有,是否有任何方法来存储文件,比如将文件流对象写入特定的文件夹/S3桶?
1条答案
按热度按时间83qze16e1#
In short: no, Spark does not support DOCX format out of the box. You can still collect the data into the driver node (i.e.: pandas dataframe) and work from there.
Long answer: A document format like DOCX is meant for presenting information in small tables with style metadata. Spark focus on processing large amount of files at scale and it does not support DOCX format out of the box.
If you want to write DOCX files programmatically, you can:
pd_f_data = f_data.toDF()
Note: if your data has more than one hundred rows, ask the receivers how they are going to use the data. Just use docx for reporting no as a file transfer format.