null单元格

9njqaruj  于 2021-05-31  发布在  Hadoop
关注(0)|答案(0)|浏览(241)

我有一个带有可选列的Parquet文件,其模式如下:

{
  "type": "record",
  "name": "schema",
  "fields": [
    {
      "name": "kpi",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "min",
      "type": [
        "null",
        "int"
      ]
    }]
}

我用图书馆看的 "org.apache.parquet" % "parquet-hadoop" % "1.11.0" 代码如下:

val file = new Path(s.getAbsolutePath)
    val inputFile = HadoopInputFile.fromPath(file, new Configuration())

    val r = new ParquetFileReader(inputFile, ParquetReadOptions.builder().build())
    val schema = r.getFooter.getFileMetaData.getSchema
    val pages: PageReadStore = r.readNextRowGroup

    val colReadStore = new ColumnReadStoreImpl(pages, new GroupRecordConverter(schema).getRootConverter, schema, "blah")
    val descriptorList = schema.getColumns
    val res = for {colDescriptor <- descriptorList}
      yield {
        val colReader = colReadStore.getColumnReader(colDescriptor)
        val totalValuesInColumnChunk = pages.getPageReader(colDescriptor).getTotalValueCount
        val res = Range(0, totalValuesInColumnChunk.toInt).map(x => {
            val value = colReader.getDescriptor.getPrimitiveType.getPrimitiveTypeName match {
              case PrimitiveType.PrimitiveTypeName.DOUBLE => colReader.getDouble
              case PrimitiveType.PrimitiveTypeName.INT32 => colReader.getInteger
              case PrimitiveType.PrimitiveTypeName.BOOLEAN => colReader.getBoolean
              case PrimitiveType.PrimitiveTypeName.BINARY => colReader.getBinary
           }
           colReader.consume()
           value
        })
        res
      }

我面临的问题是,第二列是可选的,带有一些空值。当我尝试读取第一行的单元格值时,它返回 2000 而不是null,下一个 consume 抛出错误 Reading past RLE/BitPacking stream .
这是我用作输入的Parquet文件:

我怎样才能把整列的空字符保留在原来的位置呢?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题