hive查询为匹配条件的行序列生成标识符

9bfwbjaz 于 2021-06-02 发布在 Hadoop

关注(0)|答案(3)|浏览(382)

假设我有下面的配置单元表作为输入，我们称之为 connections :

userid  | timestamp   
--------|-------------
1       | 1433258019  
1       | 1433258020
2       | 1433258080
2       | 1433258083
2       | 1433258088
2       | 1433258170
[...]   | [...]

使用以下查询：

SELECT
    userid,
    timestamp,
    timestamp - LAG(timestamp, 1, 0) OVER w AS timediff
    CASE
      WHEN timediff > 60
      THEN 'new_session'
      ELSE 'same_session'
    END AS session_state
FROM connections
WINDOW w PARTITION BY userid ORDER BY timestamp ASC;

我正在生成以下输出：

userid  | timestamp   | timediff   | session_state
--------|-------------|------------|---------------
1       | 1433258019  | 1433258019 | new_session
1       | 1433258020  | 1          | same_session
2       | 1433258080  | 1433258080 | new_session
2       | 1433258083  | 3          | same_session
2       | 1433258088  | 5          | same_session
2       | 1433258170  | 82         | new_session
[...]   | [...]       | [...]      | [...]

我该如何生成：

userid  | timestamp   | timediff   | sessionid
--------|-------------|------------------------------
1       | 1433258019  | 1433258019 | user1-session-1
1       | 1433258020  | 1          | user1-session-1
2       | 1433258080  | 1433258080 | user2-session-1
2       | 1433258083  | 3          | user2-session-1
2       | 1433258088  | 5          | user2-session-1
2       | 1433258170  | 82         | user2-session-2
[...]   | [...]       | [...]      | [...]

是否可以只使用hql和“著名”的udf（我宁愿不使用自定义udf或reducer脚本）？

hadoop Hive windowing

来源：https://stackoverflow.com/questions/30600543/hive-query-generating-identifiers-for-a-sequence-of-row-matching-a-condition

3条答案

按热度按时间

vu8f3i0k1#

这样做有效：

SELECT 
  userid,
  timestamp,
  timediff,
  CONCAT(
    'user',
     userid,
     '-',
     'session-',
     CAST(timediff / 60 AS INT) + 1
  ) AS session_id
  FROM (
    SELECT   
      userid,
      timestamp,
      timestamp - LAG(timestamp, 1, timestamp) OVER w AS timediff
    FROM connections
    WINDOW w AS (
      PARTITION BY userid
      ORDER BY timestamp ASC
    )
) a;

输出：

userid  timestamp   timediff    session_state
1       1433258019  0.0         user1-session-1
1       1433258020  1.0         user1-session-1
2       1433258080  0.0         user2-session-1
2       1433258083  3.0         user2-session-1
2       1433258088  5.0         user2-session-1
2       1433258170  82.0        user2-session-2
3       1433258270  0.0         user3-session-1

如果不需要timediff，可以尝试以下操作：
选择userid、timestamp、session\u count+concat（'user'，userid，'-'，'session-'，cast（lag（session\u count-1,1,0）over w1 as string））作为session\u state
--滞后（session\u count-1,1,0）超过w1作为session\u count\u new from（选择userid、timestamp、timediff、cast（timediff/60作为int）+1作为session\u count

赞(0）回复(0）举报 2021-06-03

5lwkijsr2#

使用下面的select concat\u ws（'-'，name，city）from employee；concat\u ws的第一个参数是separator。name和city是employee表的列名。请确保它们是字符串类型。你可以在这里找更多

赞(0）回复(0）举报 2021-06-03

atmip9wb3#

有趣的问题。根据你对@madhu的评论，我加了一行 2 1433258172 以你为例。你需要的是每次都增加 timediff > 60 他很满意。最简单的方法是标记它，然后在窗口上累计求和。
查询：

select userid
  , timestamp
  , concat('user', userid, '-session-', s_sum) sessionid
from (
  select *
    , sum( counter ) over (partition by userid
                           order by timestamp asc
                           rows between unbounded preceding and current row) s_sum
  from (
    select *
      , case when timediff > 60 then 1 else 0 end as counter
    from (
      select userid
        , timestamp
        , timestamp - lag(timestamp, 1, 0) over (partition by userid
                                                 order by timestamp asc) timediff
      from connections ) x ) y ) z

输出：

1   1433258019  user1-session-1
1   1433258020  user1-session-1
2   1433258080  user2-session-1
2   1433258083  user2-session-1
2   1433258088  user2-session-1
2   1433258170  user2-session-2
2   1433258172  user2-session-2

赞(0）回复(0）举报 2021-06-02

我来回答

hive查询为匹配条件的行序列生成标识符

3条答案

相关问题

热门标签

最新问答