hive作业在分区排序的Bucked表中读取数据和插入数据花费的时间太长

aemubtdh 于 2021-07-13 发布在 Hadoop

关注(0)|答案(1)|浏览(395)

我们有一个作业，它从一个包含大约30亿行的配置单元表中读取数据，并在一个已排序的嵌套表中插入数据。
源表和目标表中的文件都采用Parquet格式。
这项工作要花很长时间才能完成。三天后我们不得不停止工作。
我们最近迁移到了一个新的集群。旧的集群是5.12，最新的集群是6.3.1。在5.12集群中，此作业通常运行良好并在6小时内完成。但是，在新集群中花费的时间太长。
为了解决这个问题，我们尝试了以下方法results:-
拆下减速器上的盖。删除set hive.exec.reducers.max=200；
设置mapreduce.job.running.reduce.limit=100；
在源代码处合并文件以确保我们没有读取小文件。源表中的每个文件大小都增加到1g。
减少源表中的行数以减少Map程序正在读取的数据。
将最大拆分大小减少到64mb以增加Map器的数量。
插入新表。
插入未排序或未装箱的新表。
我们正在尝试运行的查询：-

set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.dynamic.partitions.pernode=100000;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.created.files=900000;

set mapreduce.input.fileinputformat.split.maxsize=64000000;
set mapreduce.job.running.reduce.limit=100;

set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;

INSERT OVERWRITE TABLE dbname.features_archive_new PARTITION (feature, ingestmonth)
Select mpn,mfr,partnum,source,ingestdate,max(value) as value,feature,ingestmonth
from dbname.features_archive_tmp
where feature = 'price'
and ingestmonth like '20%'
group by mpn,mfr,partnum,source,ingestdate,feature,ingestmonth;

hadoop Hive cloudera

来源：https://stackoverflow.com/questions/66535930/hive-job-taking-too-long-to-read-data-and-insert-in-partitioned-sorted-bucketed

1条答案

按热度按时间

kqqjbcuj1#

我们发现Cloudera6.3中的HiveVersion2.x正在使用矢量化，而旧的Cloudera5.12中的Hive1.x没有使用矢量化。
所以设置下面的属性为我们解决了这个问题。我对此没有任何解释。矢量化应该加快查询速度，而不是使其变慢。
hive.vectorized.execution.enabled=false；

赞(0）回复(0）举报 2021-07-13

我来回答

hive作业在分区排序的Bucked表中读取数据和插入数据花费的时间太长

1条答案

相关问题

热门标签

最新问答