我有一个hadoop表,如下所述:
mappers
id (int)
mapper (String)
mapperid (int)
date (int)
一些示例行如下所示
1, MAP1, 123, 20140810
1, MAP2, 3421, 20140810
2, MAP1, 34211, 20140810
2, MAP3, 1143, 20140810
3, MAP4, 12, 20140810
我试图将这些结果压缩为与它们相关联的唯一id和mapperid。我希望我的查询将基于上述示例数据返回:
1, 123, 3421, null
2, 34211, null, 1143
下面是我的配置单元查询,在这里我基本上获取了我想要的所有数据,并在适当的情况下将这些数据组合在一起,不幸的是,这需要4个表查找。
select distinct
full.id,
mapper01.mapperid,
mapper02.mapperid,
mapper03.mapperid
FROM mappers as full
LEFT JOIN (
select id, mapperid FROM mappers
WHERE mapper = "MAP1" AND
date = 20140810 AND
length(id) > 0
) AS mapper01 ON mapper01.id = full.id
LEFT JOIN (
select id, mapperid FROM mappers
WHERE mapper = "MAP2" AND
date = 20140810 AND
length(id) > 0
) AS mapper02 ON mapper02.id = full.id
LEFT JOIN (
select id, mapperid FROM mappers
WHERE mapper = "MAP3" AND
date = 20140810 AND
length(id) > 0
) AS mapper03 ON mapper03.id = full.id
WHERE date = 20140810 AND
length(id) > 0 AND
(full.mapper = "MAP1" OR
full.mapper = "MAP2" OR
full.mapper = "MAP3"
);
我正在考虑使用完全外部连接而不是左连接,这样我只需要3个表查找(最外部的grab all data将是冗余的),并使用一些if逻辑从包含该信息的表中获取full.id。
不过,我想知道是否有比使用连接更好的方法。
1条答案
按热度按时间uujelgoq1#
我写了一篇关于如何使用砖房的博客(http://github.com/klout/brickhouse )“collect”自定义项以解决此问题http://brickhouseconfessions.wordpress.com/2013/03/05/use-collect-to-avoid-the-self-join/ ).
使用collect将数据分组到单个Map中,而不是执行自连接(在hadoop中,这种连接以多次扫描数据而告终)。然后使用“map\u index”自定义项从该Map访问所需的元素。您的查询如下: