如何识别配置单元中重复出现的字符串列？

vddsk6oq 于 2021-06-27 发布在 Hive

关注(0)|答案(2)|浏览(322)

我在 hive 里看到这样的景象：

id        sequencenumber          appname
242539622              1          A
242539622              2          A
242539622              3          A
242539622              4          B
242539622              5          B
242539622              6          C
242539622              7          D
242539622              8          D
242539622              9          D
242539622             10          B
242539622             11          B
242539622             12          D
242539622             13          D
242539622             14          F

我希望每个id都有以下视图：

id        sequencenumber          appname    appname_c
242539622              1          A             A
242539622              2          A             A
242539622              3          A             A
242539622              4          B             B_1
242539622              5          B             B_1
242539622              6          C             C
242539622              7          D             D_1
242539622              8          D             D_1
242539622              9          D             D_1
242539622             10          B             B_2
242539622             11          B             B_2
242539622             12          D             D_2
242539622             13          D             D_2
242539622             14          F             F

或者任何与此接近的东西，可以识别序列中给定事件的再次出现。
我的最终目标是计算在每组事件中花费的时间（或者在markov建模的上下文中，如果您愿意的话，计算状态），同时考虑是否存在任何循环。例如，在上面的示例中，花费在b_1中的时间可以与b_2非常相似。
我已经在hive（link）中搜索了窗口函数，但我认为它们不能像r/python那样进行行比较。

Hive pyspark hiveql sparkr pyspark-sql

来源：https://stackoverflow.com/questions/55329678/how-to-identify-repeated-occurrences-of-a-string-column-in-hive

2条答案

按热度按时间

mbyulnm01#

使用配置单元窗口函数的解决方案。我用你的数据来测试它，删除 your_table 用你的table代替。结果如预期。

with your_table as (--remove this CTE, use your table instead
select stack(14,
'242539622', 1,'A',
'242539622', 2,'A',
'242539622', 3,'A',
'242539622', 4,'B',
'242539622', 5,'B',
'242539622', 6,'C',
'242539622', 7,'D',
'242539622', 8,'D',
'242539622', 9,'D',
'242539622',10,'B',
'242539622',11,'B',
'242539622',12,'D',
'242539622',13,'D',
'242539622',14,'F'
) as (id,sequencenumber,appname)
) --remove this CTE, use your table instead

select id,sequencenumber,appname, 
       case when sum(new_grp_flag) over(partition by id, group_name) = 1 then appname --only one group of consequent runs exists (like A)
            else        
            nvl(concat(group_name, '_', 
                       sum(new_grp_flag) over(partition by id, group_name order by sequencenumber) --rolling sum of new_group_flag
                       ),appname) 
        end appname_c       
from
(       

select id,sequencenumber,appname,
       case when appname=prev_appname or appname=next_appname then appname end group_name, --identify group of the same app
       case when appname<>prev_appname or prev_appname is null then 1 end new_grp_flag     --one 1 per each group
from       
(
select id,sequencenumber,appname,
       lag(appname)  over(partition by id order by sequencenumber) prev_appname, --need these columns
       lead(appname) over(partition by id order by sequencenumber) next_appname  --to identify groups of records w same app
from your_table --replace with your table
)s
)s
order by id,sequencenumber
;

结果：

OK
id        sequencenumber     appname    appname_c
242539622       1       A       A
242539622       2       A       A
242539622       3       A       A
242539622       4       B       B_1
242539622       5       B       B_1
242539622       6       C       C
242539622       7       D       D_1
242539622       8       D       D_1
242539622       9       D       D_1
242539622       10      B       B_2
242539622       11      B       B_2
242539622       12      D       D_2
242539622       13      D       D_2
242539622       14      F       F
Time taken: 232.319 seconds, Fetched: 14 row(s)

赞(0）回复(0）举报 2021-06-27

jxct1oxe2#

你需要做两个窗口函数来实现这个结果。
使用pyspark并假设 df 是您的Dataframe：

from pyspark.sql import functions as F, Window

df.withColumn(
    "fg",
    F.lag("appname").over(Window.partitionBy("id").orderBy("sequencenumber)
).withColumn(
    "fg",
    F.when(
        F.col("fg")==F.col("id"),
        0
    ).otherwise(1)
).withColumn(
    "fg",
    F.sum("fg").over(Window.partitionBy("id", "appname"))
).show()

赞(0）回复(0）举报 2021-06-27

我来回答

如何识别配置单元中重复出现的字符串列？

2条答案

相关问题

热门标签

最新问答