我正在尝试加载SparkParquet文件作为Dataframe-
val df= spark.read.parquet(path)
我要走了-
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 12, 10.250.2.32): java.lang.UnsupportedOperationException: Complex types not supported.
在浏览代码时,我意识到spark vectorizedparquetrecordreader.java(initializeinternal)中有一个check-
Type t = requestedSchema.getFields().get(i);
if (!t.isPrimitive() || t.isRepetition(Type.Repetition.REPEATED)) {
throw new UnsupportedOperationException("Complex types not supported.");
}
所以我认为它在替代方法上失败了。有人能告诉我解决这个问题的方法吗?
我的Parquet数据是-
Key1 = value1
Key2 = value1
Key3 = value1
Key4:
.list:
..element:
...key5:
....list:
.....element:
......certificateSerialNumber = dfsdfdsf45345
......issuerName = CN=Microsoft Windows Verification PCA, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......subjectName = CN=Microsoft Windows, OU=MOPR, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......thumbprintAlgorithm = Sha1
......thumbprintContent = sfdasf42dsfsdfsdfsd
......validFrom = 2009-12-07 21:57:44.000000
......validTo = 2011-03-07 21:57:44.000000
....list:
.....element:
......certificateSerialNumber = dsafdsafsdf435345
......issuerName = CN=Microsoft Root Certificate Authority, DC=microsoft, DC=com
......subjectName = CN=Microsoft Windows Verification PCA, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......thumbprintAlgorithm = Sha1
......thumbprintContent = sdfsdfdsf43543
......validFrom = 2005-09-15 21:55:41.000000
......validTo = 2016-03-15 22:05:41.000000
我怀疑key4可能因为嵌套树而引发了这个问题。输入数据是json类型的,因此parquet可能不理解json的复杂级别
我在spark发现了一只虫子https://issues.apache.org/jira/browse/hive-13744
但它指出了Hive的复杂类型问题。不确定,这是否能解决 parquet 的问题?
更新1进一步探索Parquet地板,我总结如下-
我创建了5个Parquet文件,而spark.write在这2个Parquet文件中是空的,所以一个应该是arraytype的列的模式是以字符串类型出现的,当我试图将它作为一个整体来读取时,我看到了上面的异常
1条答案
按热度按时间eimct9ow1#
取1
spark-12854 vectorize parquet reader表示“columnarbatch支持结构和数组”(参见github pull request 10820),从spark 2.0.0开始
默认情况下,spark-13518启用矢量化Parquet读取器,也从spark 2.0.0开始,处理属性
spark.sql.parquet.enableVectorizedReader
(参见github commit e809074)我的2美分:禁用“矢量恐惧”优化,看看会发生什么。
拿2块
由于问题已经缩小到一些空文件,这些文件与“真实”文件不显示相同的模式,我的3美分:实验
spark.sql.parquet.mergeSchema
以查看合并后实际文件中的架构是否优先。除此之外,您可以尝试在写入时通过某种重新分区来消除空文件。
coalesce(1)
(好吧,1有点讽刺,但你明白重点了)。