用pig计算前5名以外的“其他人”

ljsrvy3e  于 2021-06-21  发布在  Pig
关注(0)|答案(1)|浏览(302)

因此,我有数百个小组正在生成,并试图避免筛选他们,只看那些结果最多。为此,我计算他们,排序,然后限制到前5名的结果。

counts = foreach (group distinctVals by (description)) generate group, COUNT_STAR(distinctVals) as count;
ordered = order counts by count desc;
limited = limit ordered 5;
dump limited;

然而,我想有一个单独的计数有多少结果没有进入“前5名”,并把他们作为一个组简单地称为其他。
所以我的输出应该是

(John ,38436)
(Steve ,13654)
(Sarah ,9334)
(Rick ,3241)
(Morty ,784)
(Other ,3421)
dhxwm5r4

dhxwm5r41#

使用等级。排序数据后,使用 RANK 为排序关系生成秩。这将添加一个新的秩列作为第一列。然后可以使用秩列 FILTER 将数据集分为两个关系,即limited,other。一旦有了另一个关系, GROUP ALL 以及 SUM 第3列,即$2或count列。最后, UNION 有限和其他金额。

counts = foreach (group distinctVals by (description)) generate group, COUNT_STAR(distinctVals) as count;
ordered = order counts by count desc;
ordered1 = rank ordered;

limited = FILTER ordered1 BY rank_ordered <= 5;
other = FILTER ordered1 BY rank_ordered > 5;

other_grp = GROUP other ALL;
other_sum = FOREACH other_grp GENERATE SUM(other.$2);

final = UNION limited,other_sum;

相关问题