我有一个带有可选列的Parquet文件,其模式如下:
{
"type": "record",
"name": "schema",
"fields": [
{
"name": "kpi",
"type": [
"null",
"string"
]
},
{
"name": "min",
"type": [
"null",
"int"
]
}]
}
我用图书馆看的 "org.apache.parquet" % "parquet-hadoop" % "1.11.0"
代码如下:
val file = new Path(s.getAbsolutePath)
val inputFile = HadoopInputFile.fromPath(file, new Configuration())
val r = new ParquetFileReader(inputFile, ParquetReadOptions.builder().build())
val schema = r.getFooter.getFileMetaData.getSchema
val pages: PageReadStore = r.readNextRowGroup
val colReadStore = new ColumnReadStoreImpl(pages, new GroupRecordConverter(schema).getRootConverter, schema, "blah")
val descriptorList = schema.getColumns
val res = for {colDescriptor <- descriptorList}
yield {
val colReader = colReadStore.getColumnReader(colDescriptor)
val totalValuesInColumnChunk = pages.getPageReader(colDescriptor).getTotalValueCount
val res = Range(0, totalValuesInColumnChunk.toInt).map(x => {
val value = colReader.getDescriptor.getPrimitiveType.getPrimitiveTypeName match {
case PrimitiveType.PrimitiveTypeName.DOUBLE => colReader.getDouble
case PrimitiveType.PrimitiveTypeName.INT32 => colReader.getInteger
case PrimitiveType.PrimitiveTypeName.BOOLEAN => colReader.getBoolean
case PrimitiveType.PrimitiveTypeName.BINARY => colReader.getBinary
}
colReader.consume()
value
})
res
}
我面临的问题是,第二列是可选的,带有一些空值。当我尝试读取第一行的单元格值时,它返回 2000
而不是null,下一个 consume
抛出错误 Reading past RLE/BitPacking stream
.
这是我用作输入的Parquet文件:
我怎样才能把整列的空字符保留在原来的位置呢?
暂无答案!
目前还没有任何答案,快来回答吧!