优化多表连接的配置单元查询

idv4meu8  于 2021-06-03  发布在  Hadoop
关注(0)|答案(1)|浏览(360)
INSERT OVERWRITE TABLE result
SELECT /*+ STREAMTABLE(product) */
i.IMAGE_ID, 
p.PRODUCT_NO,
p.STORE_NO,
p.PRODUCT_CAT_NO,
p.CAPTION,
p.PRODUCT_DESC,
p.IMAGE1_ID,
p.IMAGE2_ID, 
s.STORE_ID, 
s.STORE_NAME, 
p.CREATE_DATE, 
CASE WHEN custImg.IMAGE_ID is NULL THEN 0 ELSE 1 END, 
CASE WHEN custImg1.IMAGE_ID is NULL THEN 0 ELSE 1 END, 
CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END
FROM image i  
JOIN PRODUCT p ON i.IMAGE_ID = p.IMAGE1_ID
JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO = pcat.PRODUCT_CAT_NO
JOIN STORE s ON p.STORE_NO = s.STORE_NO
JOIN STOCK_INFO si ON si.STOCK_INFO_ID = pcat.STOCK_INFO_ID 
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID = custImg.IMAGE_ID 
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID = custImg1.IMAGE_ID 
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg2 ON p.IMAGE2_ID = custImg2.IMAGE_ID;

我有一个连接查询,在这里我连接巨大的表,我试图优化这个配置单元查询。这是一些关于table的事实
图像表有60m行,产品表有1b行,产品目录有1000行,商店有1m行,库存信息有100行,可定制的图像有200k行。
一个产品可以有一个或两个图像(image1和image2),产品级别信息只存储在产品表中。我尝试将带有product的连接移到底部,但我无法这样做,因为下面的所有其他连接都需要product表中的数据。
这是我到目前为止试过的,1。我给了hive一个提示,让它流式处理产品表,因为它是最大的一个2。我将表(在create table期间)压缩成256个bucket(在image\u id上),然后执行join-没有给我任何显著的性能提升3。将输入格式从textfile(gzip文件)更改为sequence file,这样它就可以被拆分,因此如果hive想要运行更多的Map程序,就可以运行更多的Map程序
下面是来自配置单元控制台的一些关键日志。我在aws中运行了这个配置单元查询。有人能帮我理解这里的主要瓶颈吗?此作业仅处理实际数据的一个子集。

Stage-14 is selected by condition resolver.
Launching Job 1 out of 11
Number of reduce tasks not specified. Estimated from input data size: 22
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Kill Command = /home/hadoop/bin/hadoop job  -kill job_201403242034_0001
Hadoop job information for Stage-14: number of mappers: 341; number of reducers: 22
2014-03-24 20:55:05,709 Stage-14 map = 0%,  reduce = 0%
.
2014-03-24 23:26:32,064 Stage-14 map = 100%,  reduce = 100%, Cumulative CPU 34198.12 sec
MapReduce Total cumulative CPU time: 0 days 9 hours 29 minutes 58 seconds 120 msec
.
2014-03-25 00:33:39,702 Stage-30 map = 100%,  reduce = 100%, Cumulative CPU 20879.69 sec
MapReduce Total cumulative CPU time: 0 days 5 hours 47 minutes 59 seconds 690 msec
.
2014-03-26 04:15:25,809 Stage-14 map = 100%,  reduce = 100%, Cumulative CPU 3903.4 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 3 seconds 400 msec
.
2014-03-26 04:25:05,892 Stage-30 map = 100%,  reduce = 100%, Cumulative CPU 2707.34 sec
MapReduce Total cumulative CPU time: 45 minutes 7 seconds 340 msec
.
2014-03-26 04:45:56,465 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 3901.99 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 1 seconds 990 msec
.
2014-03-26 04:54:56,061 Stage-26 map = 100%,  reduce = 100%, Cumulative CPU 2388.71 sec
MapReduce Total cumulative CPU time: 39 minutes 48 seconds 710 msec
.
2014-03-26 05:12:35,541 Stage-4 map = 100%,  reduce = 100%, Cumulative CPU 3792.5 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 3 minutes 12 seconds 500 msec
.
2014-03-26 05:34:21,967 Stage-5 map = 100%,  reduce = 100%, Cumulative CPU 4432.22 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 13 minutes 52 seconds 220 msec
.
2014-03-26 05:54:43,928 Stage-21 map = 100%,  reduce = 100%, Cumulative CPU 6052.96 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 40 minutes 52 seconds 960 msec

MapReduce Jobs Launched: 

Job 0: Map: 59  Reduce: 18   Cumulative CPU: 3903.4 sec     HDFS Read: 37387 HDFS Write: 12658668325 SUCCESS
Job 1: Map: 48              Cumulative CPU: 2707.34 sec   HDFS Read: 12658908810 HDFS Write: 9321506973 SUCCESS
Job 2: Map: 29  Reduce: 10   Cumulative CPU: 3901.99 sec   HDFS Read: 9321641955 HDFS Write: 11079251576 SUCCESS
Job 3: Map: 42              Cumulative CPU: 2388.71 sec   HDFS Read: 11079470178 HDFS Write: 10932264824 SUCCESS
Job 4: Map: 42  Reduce: 12   Cumulative CPU: 3792.5 sec     HDFS Read: 10932405443 HDFS Write: 11812454443 SUCCESS
Job 5: Map: 45  Reduce: 13   Cumulative CPU: 4432.22 sec   HDFS Read: 11812679475 HDFS Write: 11815458945 SUCCESS
Job 6: Map: 42              Cumulative CPU: 6052.96 sec   HDFS Read: 11815691155 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 days 7 hours 32 minutes 59 seconds 120 msec
OK

在hive中,查询仍然需要5个多小时,而在rdbms中,查询只需要5个小时。我需要一些帮助来优化这个查询,以便它执行得更快。有趣的是,当我用4个大的核心示例运行任务时,与用3个大的核心示例运行相比,所花费的时间只提高了10分钟。但是当我用3个med内核运行这个任务时,花了1小时10分钟。
这让我想到了一个问题,“对于这样复杂的连接,Hive是正确的选择吗”?

mkshixfv

mkshixfv1#

我怀疑瓶颈只是在整理你的产品表,因为它似乎比其他大得多。我认为,对于超过一定大小的表,与hive的连接变得不可行,仅仅是因为它们需要排序。
有一些参数可以优化排序,比如io.sort.mb,您可以尝试设置这些参数,以便在内存中进行更多排序,而不是溢出到磁盘、重新读取和重新排序。看看溢出的记录的数量,看看这是否比你的输入要大得多。有多种方法可以优化排序。它还可以帮助您将查询分解为多个子查询,这样就不必一次排序那么多。
对于stock\u info和product\u cat表,您可以将它们保存在内存中,因为它们太小了(请查看brickhouse中的“distributed\u map”自定义项(https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/dcache/distributedmapudf.java )对于自定义图像,您可以使用bloom过滤器,如果有一些假阳性不是一个真正的大问题。
要完全删除连接,也许可以将映像信息存储在类似keystone db的hbase中以进行查找。brickhouse也有hbase的udf,比如hbase\u get和base\u cached\u get。

相关问题