我试图加快配置单元在分区托管(内部)表上的动态分区。表的架构如下:
hive> describe formatted saibhargav_history;
OK
# col_name data_type comment
appid string
appstatus string
apptype string
submittime bigint
starttime bigint
finishtime bigint
launchtime bigint
jobcounters map<string,string>
# Partition Information
# col_name data_type comment
finishyear string
finishmonth string
finishday string
finishhour string
# Detailed Table Information
Database: saibhvar
Owner: bhargav
CreateTime: Thu Sep 26 09:54:48 GMT 2019
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://user/saibhargav/jobhistory
Table Type: MANAGED_TABLE
Table Parameters:
bucketing_version 2
orc.compress SNAPPY
transient_lastDdlTime 1569491688
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
line.delim \n
serialization.format 1
Time taken: 0.401 seconds, Fetched: 54 row(s)
我每4小时运行一次提交的历史提取服务(所有临时和计划的配置单元查询),该服务通过基于在此时间范围内运行的作业的finishtime(finishyear、finishmonth、finishday和finishyear)对表进行分区来填充此表。假设在以后的迭代中,如果分区(finishyear=2020,finishmonth=04,finishday=28,finishhour=04)下的一条记录被动态添加到表中,它将用这个作业的内容覆盖这个分区的内容。
使用以下查询插入托管表:
insert into table `saibhvar.saibhargav_history` partition(`finishyear`,`finishmonth`,`finishday`,`finishhour`)
select `appId`,`appStatus`,`appType`,`submitTime`,`startTime`,`finishTime`, str_to_map(`jobCounters`,'\\006','\\005'),`finishYear`,`finishMonth`,`finishDay`,`finishHour` from `saibhvar.temp_table_1588226958`
``` `saibhvar.temp_table_1588226958` 是一个临时表,历史提取服务在其中对数据进行流式处理,它有助于将动态分区插入到托管表中。
我看了文件https://cwiki.apache.org/confluence/display/hive/dynamicpartitions.
关于如何解决此问题并防止分区中的数据被覆盖的任何想法。
暂无答案!
目前还没有任何答案,快来回答吧!