在sql(hive)中使用collect\u list函数聚合用户序列

w8biq8rn  于 2021-06-01  发布在  Hadoop
关注(0)|答案(2)|浏览(366)

我有一个数据集,它代表个人的使用情况,一种他们必须输入值的形式。
除第一步“名字”外,表格可以按任何顺序填写。
时间戳用于暗示表单的完成顺序。
例如,用户12345678于2017年10月25日04:58:08开始填写表格,并按顺序填写表格。
用户12345679在同一天的05:00:02开始填写表单,但只执行步骤2
用户12345680在05:05:06开始填写表单,但分心,没有超出步骤1,但在完成之前又返回了两次
用户12345681在06:31:12开始填写表单,并从步骤1开始,但随后随机填写表单。

|    date  |     timestamp      | user_id  |  visit_id   |      event    |  event_seq  |
|--------------------------------------------------------------------------------------|
|2017-10-25| 2017-10-25 04:58:08| 12345678 |    1234     |   firstname   |       1     |
|2017-10-25| 2017-10-25 04:58:20| 12345678 |    1234     |   lastname    |       2     |
|2017-10-25| 2017-10-25 04:58:35| 12345678 |    1234     |      dob      |       3     |
|2017-10-25| 2017-10-25 04:58:40| 12345678 |    1234     |   postcode    |       4     |
|2017-10-25| 2017-10-25 04:58:40| 12345678 |    1234     |     email     |       5     |
|2017-10-25| 2017-10-25 05:00:02| 12345679 |    1235     |   firstname   |       1     |
|2017-10-25| 2017-10-25 05:00:10| 12345679 |    1235     |   lastname    |       2     |
|2017-10-25| 2017-10-25 05:05:06| 12345680 |    1236     |   firstname   |       1     |
|2017-10-25| 2017-10-25 05:30:24| 12345680 |    1236     |   firstname   |       1     |
|2017-10-25| 2017-10-25 06:17:24| 12345680 |    1236     |   firstname   |       1     |
|2017-10-25| 2017-10-25 06:20:30| 12345680 |    1236     |   lastname    |       2     |
|2017-10-25| 2017-10-25 06:20:45| 12345680 |    1236     |      dob      |       3     |
|2017-10-25| 2017-10-25 06:20:45| 12345680 |    1236     |   postcode    |       4     |
|2017-10-25| 2017-10-25 06:20:45| 12345680 |    1236     |     email     |       5     |
|2017-10-25| 2017-10-25 06:31:12| 12345681 |    1237     |   firstname   |       1     |
|2017-10-25| 2017-10-25 06:31:18| 12345681 |    1237     |     email     |       5     |
|2017-10-25| 2017-10-25 06:31:50| 12345681 |    1237     |   lastname    |       2     |
|2017-10-25| 2017-10-25 06:32:16| 12345681 |    1237     |   postcode    |       4     |
|2017-10-25| 2017-10-25 06:32:40| 12345681 |    1237     |      dob      |       3     |

我编写的代码如下所示,并引用了一个预先存在的表,在这种情况下,when用于在名为“events”的变量中为表单的每个步骤分配一个数字:

SELECT date
,time_stamp
,user_id
,visit_id
,collect_list(events) as event_seq
FROM my_table
GROUP BY date
,start_time
time_stamp
,user_id
,visit_id

正如人们所料,这似乎是将用户12345680的所有交互都列在一个字符串中;

|    date  |     timestamp      | user_id  |  visit_id   |  event_seq  |
|----------------------------------------------------------------------|
|2017-10-25| 2017-10-25 04:58:08| 12345678 |    1234     |  1,2,3,4,5  |
|2017-10-25| 2017-10-25 05:00:02| 12345679 |    1235     |     1,2     |
|2017-10-25| 2017-10-25 05:05:06| 12345680 |    1236     |1,1,1,2,3,4,5| 
|2017-10-25| 2017-10-25 06:31:12| 12345681 |    1237     |  1,5,2,4,3, |

但是,我想看到的是序列中第一个事件取消标记的每一行,类似于下面的结果集,其中每个用户12345680的重新启动都发生在不同的行上。

|    date  |     timestamp      | user_id  |  visit_id   |  event_seq  |
|----------------------------------------------------------------------|
|2017-10-25| 2017-10-25 04:58:08| 12345678 |    1234     |  1,2,3,4,5  |
|2017-10-25| 2017-10-25 05:00:02| 12345679 |    1235     |     1,2     |
|2017-10-25| 2017-10-25 05:05:06| 12345680 |    1236     |      1      |
|2017-10-25| 2017-10-25 05:30:24| 12345680 |    1236     |      1      |
|2017-10-25| 2017-10-25 06:17:24| 12345680 |    1236     |  1,2,3,4,5  |      
|2017-10-25| 2017-10-25 06:31:12| 12345681 |    1237     |  1,5,2,4,3, |

有没有人能给我提供一些指导,告诉我如何使用collect\u list来实现我想要的结果集?

56lgkhnf

56lgkhnf1#

使用lead window函数根据event\ seq列生成next\ event\ seq列。这将为您提供另一列,其中每行的下一个事件序列。现在在where子句中使用它来比较事件序列。只要事件序列小于下一个事件序列,就意味着它是序列的一部分,需要分组。

select date,time_stamp,user_id,visit_id,collect_list(events) as event_seq
from
   ( select *,lead(event_seq,1) over (order by date,timestamp,user_id,event_seq) as next_event_seq from my_table ) T
where event_seq < T.next_event_seq
group by date,time_stamp,user_id,visit_id
h79rfbju

h79rfbju2#

在sql中,表中似乎有一个名为start\u time的列。假设您有一个,请参阅下面的解决方案
table

CREATE EXTERNAL TABLE my_table(
  event_date DATE,
  event_start_timestamp TIMESTAMP,
  event_timestamp TIMESTAMP,
  user_id STRING,
  visit_id STRING,
  event STRING,
  event_seq STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
STORED AS TEXTFILE;

数据

2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:08| 12345678 |    1234     |   firstname   |       1     
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:20| 12345678 |    1234     |   lastname    |       2     
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:35| 12345678 |    1234     |      dob      |       3     
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:40| 12345678 |    1234     |   postcode    |       4     
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:40| 12345678 |    1234     |     email     |       5     
2017-10-25| 2017-10-25 05:00:02| 2017-10-25 05:00:02| 12345679 |    1235     |   firstname   |       1     
2017-10-25| 2017-10-25 05:00:02| 2017-10-25 05:00:10| 12345679 |    1235     |   lastname    |       2     
2017-10-25| 2017-10-25 05:05:06| 2017-10-25 05:05:06| 12345680 |    1236     |   firstname   |       1     
2017-10-25| 2017-10-25 05:30:24| 2017-10-25 05:30:24| 12345680 |    1236     |   firstname   |       1     
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:17:24| 12345680 |    1236     |   firstname   |       1     
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:30| 12345680 |    1236     |   lastname    |       2     
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:45| 12345680 |    1236     |      dob      |       3     
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:45| 12345680 |    1236     |   postcode    |       4     
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:45| 12345680 |    1236     |     email     |       5     
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:31:12| 12345681 |    1237     |   firstname   |       1     
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:31:18| 12345681 |    1237     |     email     |       5     
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:31:50| 12345681 |    1237     |   lastname    |       2     
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:32:16| 12345681 |    1237     |   postcode    |       4     
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:32:40| 12345681 |    1237     |      dob      |       3

SQL查询

SELECT event_date,
       user_id, 
       visit_id, 
       event_start_timestamp, 
       collect_list(event_seq)
  FROM (SELECT event_date, 
               event_start_timestamp, 
               event_timestamp, 
               user_id, 
               visit_id, 
               event_seq 
          FROM my_table 
          SORT BY user_id, visit_id, event_start_timestamp, event_timestamp ASC) v
 GROUP BY event_date, user_id, visit_id, event_start_timestamp ;

输出

2017-10-25   12345678       1234        2017-10-25 04:58:08 ["       1     ","       2     ","       3     ","       4     ","       5     "]
2017-10-25   12345679       1235        2017-10-25 05:00:02 ["       1     ","       2     "]
2017-10-25   12345680       1236        2017-10-25 05:05:06 ["       1     "]
2017-10-25   12345680       1236        2017-10-25 05:30:24 ["       1     "]
2017-10-25   12345680       1236        2017-10-25 06:17:24 ["       1     ","       2     ","       3     ","       4     ","       5     "]
2017-10-25   12345681       1237        2017-10-25 06:31:12 ["       1     ","       5     ","       2     ","       4     ","       3     "]

让我们知道这是否有效!
请不要将列名用作日期、时间戳等,它们是保留字:)

相关问题