count on group by on multiple columns并获取原始数据集

taor4pac  于 2021-06-21  发布在  Pig
关注(0)|答案(1)|浏览(436)
2, cornflakes, Regular,General Mills, 12    
3, cornflakes, Mixed Nuts, Post, 14  
4, chocolate syrup, Regular, Hersheys, 5   
5, chocolate syrup, No High Fructose, Hersheys, 8  
6, chocolate syrup, Regular, Ghirardeli, 6  
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7

脚本

data_grp = GROUP data BY (item, type);
data_cnt = FOREACH data_grp GENERATE FLATTEN (group) AS(item, type), count(data) as total; 
filter_data = FILTER data_cnt BY total < 2;

我现在需要应用过滤器的原始数据,我想要的输出是:

4, chocolate syrup, Regular, Hersheys, 5
6, chocolate syrup, Regular, Ghirardeli, 6
rdrgkggo

rdrgkggo1#

过滤数据会给你 chocolate syrup, Regular 。将筛选数据与原始数据集的项联接,键入并获得所需结果。

data_grp = GROUP data BY (item, type);
data_cnt = FOREACH data_grp GENERATE FLATTEN (group) AS(item, type), COUNT(data) as total; 
filter_data = FILTER data_cnt BY total < 2;
o_data = JOIN data BY (item,type),filter_data BY ($0,$1);
final_data = FOREACH o_data GENERATE $0..$4;
DUMP final_data;

相关问题