hive collect \u list()用于在列具有连续重复项时收集列值

iqjalb3h  于 2021-05-17  发布在  Spark
关注(0)|答案(1)|浏览(448)

如果一列连续具有相同的值,我必须创建一个对应字段值的列表,如果出现相同的值,则必须创建另一个列表。我试过了 collect_list() 但它将同一列按顺序分组。表格如下。

| Timestamp | id | Grp | CD |
|-----------|----|-----|----|
| 05:59     | 1  | A   | W1 |
| 06:00     | 1  | A   | W2 |
| 7:00      | 1  | B   | W3 |
| 7:00      | 1  | A   | W4 |
| 7:01      | 1  | A   | W5 |
| 7:02      | 1  | A   | W6 |

表按时间戳排序。
我希望结果如下

| id | agg        |
|----|------------|
| 1  | [W1,W2]    |
| 1  | [W3]       |
| 1  | [W4,W5,W6] |
7cjasjjr

7cjasjjr1#

我曾为我的团队尝试过类似的场景。请在下面找到。

val df=Seq(("05:59","1","A"),("06:00","1","A"),("7:00","1","B"),("7:00","1","A"),("7:01","1","A"),("7:02","1","A")).toDF("Timestamp","id","Grp")
df.createOrReplaceTempView("df")
val df2=spark.sql("select *,lag(grp) OVER w as prev_grp,lead(grp) OVER w as next_grp  from df  WINDOW w AS ( ORDER BY Timestamp)")
df2.createOrReplaceTempView("df2")

spark.sql("""select id,collect_list(grp)  from (select *,
  SUM(CASE WHEN (grp=prev_grp and grp = next_grp) THEN 0  
  WHEN (grp=next_grp and grp != prev_grp) THEN 1 
  WHEN (grp=prev_grp and (grp != next_grp or next_grp is null) ) THEN 0 
  ELSE 1 END) OVER
  (ORDER BY Timestamp
   ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW)+1 as EVENT_SEQ
from df2
ORDER BY Timestamp) s group by EVENT_SEQ,id""").show(false)

输入

df.show(false)
+---------+---+---+
|Timestamp|id |Grp|
+---------+---+---+
|05:59    |1  |A  |
|06:00    |1  |A  |
|7:00     |1  |B  |
|7:00     |1  |A  |
|7:01     |1  |A  |
|7:02     |1  |A  |
+---------+---+---+

输出:

+---+-----------------+
|id |collect_list(grp)|
+---+-----------------+
|1  |[A, A, A]        |
|1  |[B]              |
|1  |[A, A]           |
+---+-----------------+

相关问题