可供选择的/比自连接配置单元查询更有效的方法

0pizxfdo  于 2021-06-04  发布在  Hadoop
关注(0)|答案(1)|浏览(306)

我有一个hadoop表,如下所述:

mappers
id (int)
mapper (String)
mapperid (int)
date (int)

一些示例行如下所示

1, MAP1, 123, 20140810
1, MAP2, 3421, 20140810
2, MAP1, 34211, 20140810
2, MAP3, 1143, 20140810
3, MAP4, 12, 20140810

我试图将这些结果压缩为与它们相关联的唯一id和mapperid。我希望我的查询将基于上述示例数据返回:

1, 123, 3421, null
2, 34211, null, 1143

下面是我的配置单元查询,在这里我基本上获取了我想要的所有数据,并在适当的情况下将这些数据组合在一起,不幸的是,这需要4个表查找。

select distinct 
full.id, 
mapper01.mapperid, 
mapper02.mapperid, 
mapper03.mapperid 
FROM mappers as full
LEFT JOIN (
    select id, mapperid FROM mappers
    WHERE mapper = "MAP1" AND
    date = 20140810 AND
    length(id) > 0
) AS mapper01 ON mapper01.id = full.id
LEFT JOIN (
    select id, mapperid FROM mappers
    WHERE mapper = "MAP2" AND
    date = 20140810 AND
    length(id) > 0
) AS mapper02 ON mapper02.id = full.id
LEFT JOIN (
    select id, mapperid FROM mappers
WHERE mapper = "MAP3" AND
    date = 20140810 AND
    length(id) > 0
) AS mapper03 ON mapper03.id = full.id
WHERE date = 20140810 AND
length(id) > 0 AND
(full.mapper = "MAP1" OR
full.mapper = "MAP2" OR
full.mapper = "MAP3"
);

我正在考虑使用完全外部连接而不是左连接,这样我只需要3个表查找(最外部的grab all data将是冗余的),并使用一些if逻辑从包含该信息的表中获取full.id。
不过,我想知道是否有比使用连接更好的方法。

uujelgoq

uujelgoq1#

我写了一篇关于如何使用砖房的博客(http://github.com/klout/brickhouse )“collect”自定义项以解决此问题http://brickhouseconfessions.wordpress.com/2013/03/05/use-collect-to-avoid-the-self-join/ ).
使用collect将数据分组到单个Map中,而不是执行自连接(在hadoop中,这种连接以多次扫描数据而告终)。然后使用“map\u index”自定义项从该Map访问所需的元素。您的查询如下:

SELECT ID, MAP_INDEX( id_map, 'MAP1' ) as MAP1_ID,
           MAP_INDEX( id_map, 'MAP2' ) as MAP2_ID,
           MAP_INDEX( id_map, 'MAP3' ) as MAP3_ID
 FROM 
   ( SELECT id, COLLECT( mapper, mapperid) as id_map
      FROM mappers
      WHERE  date = 20140810 
       AND  length(id) > 0
     GROUP BY id
   ) m ;

相关问题