sql—使用配置单元ql/impala/python消除重复ID

qmb5sa22 于 2021-06-26 发布在 Hive

关注(0)|答案(1)|浏览(336)

我需要帮助消除一组不同ID中的用户列表（2000万+）。
下面是它的样子：
-我们有3种用户ID:id1、id2和id3。-其中至少有两个始终一起发送：id1与id2或id2与id3。id3从不与id1一起发送。
-用户可以有多个id1、id2或id3。
-所以有时候，在我的表中，我会有几行有很多不同的id，但有可能所有这些都可以描述一个用户。
举个例子：

所有这些ID都显示一个用户。
我想我可以添加一个fourthid（groupid），这就是消除重复数据的方法。有点像这样：

问题是：我知道如何通过cursor/open/fetch/next命令在sqlserver上实现这一点，但我的环境中只有hiveql、impala和python可用。
有谁知道最好的方法是什么？
万分感谢，
雨果

sql Hive impala python duplicates

来源：https://stackoverflow.com/questions/49402019/deduplicating-ids-with-hive-ql-impala-python

1条答案

按热度按时间

6rvt4ljy1#

根据您的示例，假设id2始终存在，您可以聚合行，按id2分组：

select max(id1) id1,  id2, max(id3) id3 from
( --your dataset as in example
 select 'A'  as id1, 1 as id2,  null   as id3 union all
 select null as id1, 1 as id2, 'Alpha' as id3 union all
 select null as id1, 2 as id2, 'Beta'  as id3 union all
 select 'A'  as id1, 2 as id2,  null   as id3
 )s
 group by id2;

OK
A       1       Alpha
A       2       Beta
Time taken: 58.739 seconds, Fetched: 2 row(s)

现在我试着实现你描述的逻辑：

select --pass2
 id1, id2, id3,
 case when lag(id2) over (order by id2, GroupId) = id2 then lag(GroupId) over (order by id2, GroupId) else GroupId end GroupId2
 from
 (
 select        --pass1
 id1, id2, id3,
 case when 
 lag(id1) over(order by id1, NVL(ID1,ID3)) =id1 then lag(NVL(ID1,ID3))  over(order by id1, NVL(ID1,ID3)) else NVL(ID1,ID3) end GroupId
 from
( --your dataset as in example
 select 'A'  as id1, 1 as id2,  null   as id3 union all
 select null as id1, 1 as id2, 'Alpha' as id3 union all
 select null as id1, 2 as id2, 'Beta'  as id3 union all
 select 'A'  as id1, 2 as id2,  null   as id3
 )s
 )s --pass1
;

OK
id1     id2     id3     groupid2
A       1       NULL    A
NULL    1       Alpha   A
A       2       NULL    A
NULL    2       Beta    A
Time taken: 106.944 seconds, Fetched: 4 row(s)

赞(0）回复(0）举报 2021-06-26

我来回答

sql—使用配置单元ql/impala/python消除重复ID

1条答案

相关问题

热门标签

最新问答