无法使用presto db中的嵌套字段查询Parquet数据

y3bcpkx1  于 2021-06-26  发布在  Hive
关注(0)|答案(1)|浏览(517)

我有数据,其中一些包括嵌套列(对象数组),在spark 2.2中保存为parquet。
现在我尝试用presto从外部访问这个数据,当我尝试访问任何嵌套的列时,会出现以下异常。

com.facebook.presto.spi.PrestoException: Error opening Hive split hdfs://name-node/parquet_path/part-00023-8d4f14b1-a3f1-4055-b931-04838701048d-c000.snappy.parquet (offset=0, length=108289): parquet.io.PrimitiveColumnIO cannot be cast to parquet.io.GroupColumnIO
    at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:220)
    at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:115)
    at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:157)
    at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:93)
    at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
    at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
    at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:239)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:373)
    at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:282)
    at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:672)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:973)
    at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
    at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:477)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
 Caused by: java.lang.ClassCastException: parquet.io.PrimitiveColumnIO cannot be cast to parquet.io.GroupColumnIO
    at parquet.io.ColumnIOConverter.constructField(ColumnIOConverter.java:56)
    at parquet.io.ColumnIOConverter.constructField(ColumnIOConverter.java:90)
 at com.facebook.presto.hive.parquet.ParquetPageSource.<init>(ParquetPageSource.java:109)

有趣的是,我能够查询其他非嵌套列而没有任何问题。
创建表如下所示:

CREATE TABLE hive.tests.table_name (
not_nested_field_1 BIGINT,
not_nested_field_2 BIGINT,
not_nested_field_3 BOOLEAN,
not_nested_field_4 DOUBLE,
not_nested_field_5 ARRAY(VARCHAR),
not_nested_field_5 ARRAY(ROW(
    nested_level0_field1 BOOLEAN,
    nested_level0_field2 BIGINT,
    nested_level0_field3 BIGINT,
    nested_level0_field4 ARRAY(ROW(
        nested_level1_field1 BOOLEAN,
        nested_level1_field2 BIGINT,
        nested_level1_field3 VARCHAR,
        nested_level1_field4 ARRAY(ROW(
            nested_level2_field1 VARCHAR,
            nested_level2_field2 VARCHAR,
            nested_level2_field3 ARRAY(ROW(
                nested_level3_field1 VARCHAR,
                nested_level3_field2 VARCHAR)))),
        nested_level1_field5 ARRAY(ROW(
            nested_level2_field4 BIGINT,
            nested_level2_field5 BIGINT,
            nested_level2_field6 ARRAY(ROW(
                nested_level3_field3 VARCHAR,
                nested_level3_field4 VARCHAR)))))))))
WITH (
  format = 'PARQUET',
  external_location = 'hdfs://name-node/parquet_path/'
);

使用presto版本0.208,使用本地配置单元元存储创建外部表。
任何帮助都将不胜感激:)

bwntbbo3

bwntbbo31#

这个问题已得到解决 hive.parquet.use-column-names=true 中定义的属性 catalog/hive.properties 默认情况下,presto将使用列索引来访问数据,因此需要显式定义此属性,以便按照中的定义在parquet中使用列名 CREATE TABLE .

相关问题