通过hiveql添加组id

yftpprvb  于 2021-06-25  发布在  Hive
关注(0)|答案(1)|浏览(329)

我在hadoop中有一个表,它有两列包含字符串数据。
所以,对于这样一张table:

+---------+------+
|   v1    | v2   |
+---------+------+
|    1    | 2    |
|    1    | 3    |
|    2    | 3    |
|    4    | 5    |
|    6    | 7    |
+---------+------+

现在,我想为每一行添加组id。无论值属于哪个列,具有相同值的行都会获得相同的组id。
就像下面一样。

+---------+------+------+
|   v1    | v2   | gid  |
+---------+------+------+
|    1    | 2    | 1    |
|    1    | 3    | 1    |
|    2    | 3    | 1    |
|    4    | 5    | 2    |
|    6    | 7    | 3    |
+---------+------+------+

我怎么写才能通过hiveql得到它?

gv8xihay

gv8xihay1#

一般来说,这就是@gordonlinoff所说的图行走问题,但是如果任务可以简化为您的示例中那样,使用有限的列并具有排序规则(我假设行的顺序是由您的列定义的),那么任务是完全可解的。使用数据集查看此演示,它会生成所需的结果(请阅读代码中的注解):

with your_data as ( --your data example
select stack (5,
1, 2,
1, 3,
2, 3,
4, 5,
6, 7 
) as (v1,v2)
) --your data example

select v1, v2, --calculate group Id as a running count of new_grp
       count(new_grp) over(order by v1, v2 rows between unbounded preceding and current row) as gid
from
(
select v1, v2, --calculate new_grp flag
       case when ((not array_contains(prev_tuple,v1) and not array_contains(prev_tuple,v2)) or prev_tuple is null) then true end as new_grp
from
(
select v1, v2, lag(tuple) over (order by v1, v2) prev_tuple --get previous values in array, to simplify code a little bit
  from (select v1, v2, array(v1, v2) as tuple from your_data) s
)s
)s;

结果:

v1      v2      gid
1       2       1
1       3       1
2       3       1
4       5       2
6       7       3

相关问题