最近几个使用skew数据集的Map程序需要很长时间才能在groupby配置单元map reduce上运行

fbcarpbf 于 2021-05-29 发布在 Hadoop

关注(0)|答案(0)|浏览(265)

我正在运行一个简单的groupby查询，如下所示，查询数据集大小为3.5 tb。我知道我的数据有偏差。“partno”列贡献了95%的数据集，因此整个工作需要9个小时才能完成，而最后几个Map者花费的时间最长。
你能帮我这个，我需要优化这个有效。基本上，我需要帮助处理groupby和join方面的倾斜数据。

select cntry,partno
percentile_approx(part_pr,0.999) as part_pr_cutoff
from sourceTable 
GROUP BY cntry,partno;

下面是我在hql文件中使用的hive.properties。

SET hive.exec.compress.intermediate=true;
SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.intermediate.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapreduce.output.fileoutputformat.compress.type=BLOCK;
SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask=true;
SET hive.auto.convert.join.noconditionaltask.size=10000000;
SET hive.groupby.skewindata=true;
SET hive.optimize.skewjoin.compiletime=true;
SET hive.optimize.skewjoin=true;
SET hive.optimize.bucketmapjoin=true;
SET hive.exec.parallel=true;
SET hive.cbo.enable=true;
SET hive.stats.autogather=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.optimize.index.filter=true;
SET hive.optimize.ppd=true;
SET hive.mapjoin.smalltable.filesize=25000000;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=1000;
SET mapreduce.reduce.memory.mb=10240;
SET mapreduce.reduce.java.opts=-Xmx9216m;
SET mapreduce.map.memory.mb=10240;
SET mapreduce.map.java.opts=-Xmx9216m;
SET mapreduce.task.io.sort.mb=1536;
SET hive.optimize.groupby=true;
SET hive.groupby.orderby.position.alias=true;
SET hive.multigroupby.singlereducer=true;
SET hive.optimize.point.lookup=true;
SET hive.optimize.point.lookup.min=true;
SET hive.merge.mapfiles=true;
SET hive.merge.smallfiles.avgsize=128000000;
SET hive.merge.size.per.task=268435456;
SET hive.map.aggr=true;
SET hive.optimize.distinct.rewrite=true;
SET mapreduce.map.speculative=false;
set hive.fetch.task.conversion = more;
set hive.fetch.task.aggr=true;
set hive.fetch.task.conversion.threshold=1024000000;

hadoop Hive mapreduce hiveql skew

来源：https://stackoverflow.com/questions/56353626/last-few-mappers-with-skew-data-set-taking-long-time-to-run-on-groupby-hive-map

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

最近几个使用skew数据集的Map程序需要很长时间才能在groupby配置单元map reduce上运行

暂无答案！

相关问题

热门标签

最新问答