如何使用hive处理扭曲的数据？

pod7payv 于 2021-06-02 发布在 Hadoop

关注(0)|答案(3)|浏览(303)

我在 hive 里做联合行动。但当减速器达到99%时，减速器就卡住了。
然后我发现表中有倾斜数据。例如，在表a中有100万个数据，而表b只有10k。在表a中，连接列有80%的值是相同的，其余的是其他的。所以Hive减缩器停留在这个值。
我的问题是：

INSERT INTO TABLE xyz SELECT m.name, m.country, m.user_type, m.category FROM A m JOIN category n ON (m.name = n.name) where country=2 GROUP BY m.name, m.country, m.user_type, m.category;

所以请提出可能的解决方案。如何处理此类数据的联接操作。

hadoop Hive inner-join JoinTable

来源：https://stackoverflow.com/questions/36147699/how-to-process-skewed-data-using-hive

3条答案

按热度按时间

7cjasjjr1#

从hive0.10.0开始，可以将表创建为倾斜的或更改为倾斜的（在这种情况下，在alter语句之后创建的分区将被倾斜）。另外，倾斜表可以通过指定“存储为目录”选项来使用列表bucketing特性。有关详细信息，请参阅ddl文档：create table、skewed tables和alter table skewed或存储为目录
请使用此链接作为参考。

赞(0）回复(0）举报 2021-06-02

oknrviil2#

您可以尝试mapjoin，如下所示：

set hive.auto.convert.join = true;
set hive.mapjoin.smalltable.filesize=25000000; -- This default value is 25MB, you can change it.

赞(0）回复(0）举报 2021-06-02

ohtdti5x3#

找到了解决上述问题的方法。
在执行配置单元联接之前，将以下参数设置为配置单元。

set hive.optimize.skewjoin=true;
set hive.skewjoin.key=100000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.vectorized.execution.reduce.groupby.enabled = true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.autogather=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.auto.convert.join=false;

很少有参数需要根据数据大小和集群大小进行更改。

赞(0）回复(0）举报 2021-06-02

我来回答

如何使用hive处理扭曲的数据？

3条答案

相关问题

热门标签

最新问答