当我试图用pyspark从本地目录保存Parquet文件时,出现以下错误。我试过spark 1.6和2.2都给出了相同的错误
它正确地显示模式,但在写入文件时出错。
base_path = "file:/Users/xyz/Documents/Temp/parquet"
reg_path = "file:/Users/xyz/Documents/Temp/parquet/ds_id=48"
df = sqlContext.read.option( "basePath",base_path).parquet(reg_path)
out_path = "file:/Users/xyz/Documents/Temp/parquet/out"
df2 = df.coalesce(5)
df2.printSchema()
df2.write.mode('append').parquet(out_path)
org.apache.spark.SparkException: Task failed while writing rows
Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)
1条答案
按热度按时间vs91vp4v1#
在我自己的例子中,我为apachetika编写了一个定制的parquet解析器,我遇到了这个错误。结果表明,如果文件正被另一个进程使用,parquetreader将无法访问
uncompressed_page_size
. 因此,导致错误。验证其他进程是否没有保留该文件。