mapreduce—为impala/hive存储使用演进的avro模式

xriantvc 于 2021-06-26 发布在 Hive

关注(0)|答案(1)|浏览(382)

我们有一个json结构，需要在impala/hive中解析和使用它。由于json结构正在演变，我们认为可以使用avro。
我们计划解析json并将其格式化为avro。
avro格式的数据可以由impala直接使用。假设我们将它存储在hdfs目录/user/hdfs/person\u data中/
我们将继续把avro序列化的数据放在那个文件夹中，我们将逐个解析输入json。
假设，我们有一个个人的avro模式文件(hdfs://user/hdfs/avro/scheams/person.avsc)就像

{
 "type": "record",
 "namespace": "avro",
 "name": "PersonInfo",
 "fields": [
   { "name": "first", "type": "string" },
   { "name": "last", "type": "string" },
   { "name": "age", "type": "int" }
 ]
}

为此，我们将通过创建外部表在配置单元中创建表-

CREATE TABLE kst
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='hdfs://user/hdfs/avro/scheams/person.avsc');

假设明天我们需要改变这个模式(hdfs://user/hdfs/avro/scheams/person.avsc)至-

{
 "type": "record",
 "namespace": "avro",
 "name": "PersonInfo",
 "fields": [
   { "name": "first", "type": "string" },
   { "name": "last", "type": "string" },
   { "name": "age", "type": "int" },
   { "name": "city", "type": "string" }
 ]
}

我们是否可以继续将新的序列化数据放在同一hdfs目录/user/hdfs/person\u data/中，并且impala/hive仍然可以通过将city列作为空值旧记录来工作？

Hive mapreduce impala avro hadoop2

来源：https://stackoverflow.com/questions/36362171/using-evolving-avro-schema-for-impala-hive-storage

1条答案

按热度按时间

pnwntuvh1#

可以，但对于所有新列，应指定默认值：

{ "name": "newField", "type": "int", "default":999 }

或将其标记为可空：

{ "name": "newField", "type": ["null", "int"] }

赞(0）回复(0）举报 2021-06-26

我来回答

mapreduce—为impala/hive存储使用演进的avro模式

1条答案

相关问题

热门标签

最新问答