如何使用parquetwriter将时间戳逻辑类型(int96)写入parquet？

rbpvctlc 于 2021-05-27 发布在 Hadoop

关注(0)|答案(2)|浏览(624)

我有一个工具，它使用org.apache.parquet.hadoop.parquetwriter将csv数据文件转换为parquet数据文件。
目前，它只处理 int32 , double ，和 string 我需要支撑地板 timestamp 逻辑类型（注解为int96），我不知道该怎么做，因为我在网上找不到精确的规范。
似乎这种时间戳编码（int96）很少见，而且不受很好的支持。我在网上发现了很少的规格细节。此github自述声明：
保存为int96的时间戳由一天中的纳秒（前8字节）和儒略日（后4字节）组成。
明确地：
messagetype架构中的列使用哪种Parquet类型？我想我应该使用原语类型， PrimitiveTypeName.INT96 ，但我不确定是否有办法指定逻辑类型？
如何写入数据？i、 e.我应该以什么格式将时间戳写入组？对于一个int96时间戳，我假设我必须写一些二进制类型？
下面是我的代码的简化版本，它演示了我要做的事情。具体来说，看看“todo”注解，这是代码中与上述问题相关的两点。

List<Type> fields = new ArrayList<>();
fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.INT32, "int32_col", null));
fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.DOUBLE, "double_col", null));
fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.STRING, "string_col", null));

// TODO: 
//   Specify the TIMESTAMP type. 
//   How? INT96 primitive type? Is there a logical timestamp type I can use w/ MessageType schema?
fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.INT96, "timestamp_col", null)); 

MessageType schema = new MessageType("input", fields);

// initialize writer
Configuration configuration = new Configuration();
configuration.setQuietMode(true);
GroupWriteSupport.setSchema(schema, configuration);
ParquetWriter<Group> writer = new ParquetWriter<Group>(
  new Path("output.parquet"),
  new GroupWriteSupport(),
  CompressionCodecName.SNAPPY,
  ParquetWriter.DEFAULT_BLOCK_SIZE,
  ParquetWriter.DEFAULT_PAGE_SIZE,
  1048576,
  true,
  false,
  ParquetProperties.WriterVersion.PARQUET_1_0,
  configuration
);

// write CSV data
CSVParser parser = CSVParser.parse(new File(csv), StandardCharsets.UTF_8, CSVFormat.TDF.withQuote(null));
ArrayList<String> columns = new ArrayList<>(schemaMap.keySet());
int colIndex;
int rowNum = 0;
for (CSVRecord csvRecord : parser) {
  rowNum ++;
  Group group = f.newGroup();
  colIndex = 0;
  for (String record : csvRecord) {
    if (record == null || record.isEmpty() || record.equals( "NULL")) {
      colIndex++;
      continue;
    }

    record = record.trim();
    String type = schemaMap.get(columns.get(colIndex)).get("type").toString();
    MessageTypeConverter.addTypeValueToGroup(type, record, group, colIndex++);

    switch (colIndex) {
      case 0: // int32
        group.add(colIndex, Integer.parseInt(record));
        break;
      case 1: // double
        group.add(colIndex, Double.parseDouble(record));
        break;
      case 2: // string
        group.add(colIndex, record);
        break;
      case 3:
        // TODO: convert CSV string value to TIMESTAMP type (how?)
        throw new NotImplementedException();
    }
  }
  writer.write(group);
}
writer.close();

Java hadoop apache-spark parquet

来源：https://stackoverflow.com/questions/54657496/how-to-write-timestamp-logical-type-int96-to-parquet-using-parquetwriter

2条答案

按热度按时间

h79rfbju1#

我用sparksql中的代码作为参考，找到了它。
int96二进制编码分为两部分：前8个字节是纳秒，因为午夜后最后4个字节是朱利安日

String value = "2019-02-13 13:35:05";

final long NANOS_PER_HOUR = TimeUnit.HOURS.toNanos(1);
final long NANOS_PER_MINUTE = TimeUnit.MINUTES.toNanos(1);
final long NANOS_PER_SECOND = TimeUnit.SECONDS.toNanos(1);

// Parse date
SimpleDateFormat parser = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Calendar cal = Calendar.getInstance(TimeZone.getTimeZone("UTC"));
cal.setTime(parser.parse(value));

// Calculate Julian days and nanoseconds in the day
LocalDate dt = LocalDate.of(cal.get(Calendar.YEAR), cal.get(Calendar.MONTH)+1, cal.get(Calendar.DAY_OF_MONTH));
int julianDays = (int) JulianFields.JULIAN_DAY.getFrom(dt);
long nanos = (cal.get(Calendar.HOUR_OF_DAY) * NANOS_PER_HOUR)
        + (cal.get(Calendar.MINUTE) * NANOS_PER_MINUTE)
        + (cal.get(Calendar.SECOND) * NANOS_PER_SECOND);

// Write INT96 timestamp
byte[] timestampBuffer = new byte[12];
ByteBuffer buf = ByteBuffer.wrap(timestampBuffer);
buf.order(ByteOrder.LITTLE_ENDIAN).putLong(nanos).putInt(julianDays);

// This is the properly encoded INT96 timestamp
Binary tsValue = Binary.fromReusedByteArray(timestampBuffer);

赞(0）回复(0）举报 2021-05-27

hfyxw5xn2#

int96时间戳使用不带任何逻辑类型的int96物理类型，因此不要对它们进行任何注解。
如果您对int96时间戳的结构感兴趣，请看这里。如果您想查看与此格式进行转换的示例代码，请从配置单元中查看此文件。

赞(0）回复(0）举报 2021-05-27