如何在pig或hive中使用array\u agg()聚合函数

qfe3c7zg  于 2021-06-24  发布在  Pig
关注(0)|答案(2)|浏览(524)

我有以下数据:

================================================================
session_id                       screen_name  screen_launch_time
================================================================
990004916946605-1404157897784    screen1      1404157898275
990004916946605-1404157897784    screen2      1404157898337
990004947764274-1435162269418    screen1      1435162274044
990004947764274-1435162269418    screen3      1435162274081

我想使用 array_agg 函数以以下格式获取我的数据:

=========================================================
session_id                      screen_flow         count
=========================================================
990004916946605-1404157897784   screen1->screen2    1
990004947764274-1435162269418   screen1->screen3    1

有人试过写一本书吗 UDAF 或者 python 脚本来实现 array_agg 功能?
请分享你的想法。

cig3rfwq

cig3rfwq1#

只需按 session_id ,浓缩液 screen_name ,并对每组记录进行计数。如果你不想建造砖瓦罐,你可以使用 collect_list() 而不是 collect() (但我不推荐)。
查询:

add jar /path/to/jars/brickhouse-0.7.1.jar;
create temporary function collect as "brickhouse.udf.collect.CollectUDAF";

select session_id, screen_flow
  , count(*) count
from (
  select session_id
    , concat_ws('->', collect(screen_name)) screen_flow
  from db.table
  group by session_id ) x
group by session_id, screen_flow

输出:

990004916946605-1404157897784   screen1->screen2    1
990004947764274-1435162269418   screen1->screen3    1
juzqafwq

juzqafwq2#

输入:-

990004916946605-1404157897784,screen1,1404157898275
990004916946605-1404157897784,screen2,1404157898337
990004947764274-1435162269418,screen1,1435162274044
990004947764274-1435162269418,screen3,1435162274081

下面是Pig式的答案。。

records = LOAD '/user/user/inputfiles/session_id.txt' USING PigStorage(',') AS (session_id:chararray,screen_name:chararray,screnn_launch_time:chararray);

rec_grped = GROUP records BY session_id;

rec_each =  FOREACH rec_grped 
                     {
                       rec_inner_each = FOREACH records GENERATE screen_name;

                       GENERATE group as session_id, REPLACE(BagToString(rec_inner_each),'_','-->') as screen_flow, 1 as cnt;
};

dump rec_each;

输出:-

990004916946605-1404157897784   screen1-->screen2    1
990004947764274-1435162269418   screen1-->screen3    1

相关问题