在嵌套foreach语句中重用pig组

ghhkc1vu  于 2021-06-21  发布在  Pig
关注(0)|答案(1)|浏览(340)

我试着将记录分组,计算出得分1的平均值,过滤掉得分的下半部分,然后计算出得分2的平均值。显然,我可以计算摘要统计信息,并将它们重新连接到原始数据集,但我更喜欢使用中间分组值。
示例输入

ID,GROUPBY,SCORE1,SCORE2
1,A,58.8,67.3
2,A,85.2,76.3
3,B,49.1,90.7
4,B,78.3,99.8

Pig手稿

records = load 'example.csv' Using PigStorage(',') AS (ID,GROUPBY,SCORE1,SCORE2);
grouped = group records by GROUPBY;
avgscore = foreach grouped GENERATE group AS GROUPBY, AVG(records.SCORE1) AS AVGSCORE;
joined = join grouped BY group, avgscore BY GROUPBY USING 'replicated';
results = foreach joined {
    scores = foreach records generate SCORE1,SCORE2;
    low = FILTER scores by SCORE1 < avgscore.AVGSCORE;
    GENERATE GROUPBY, AVG(low.SCORE2);
};
dump results;

期望输出

A    67.3
B    90.7

但是,这给出了java.lang.exception的结果:org.apache.pig.backend.executionengine.executeption:错误0:标量在输出中有多行。第一名:(a,72.0),第二名:(b,63.7)

erhoui1w

erhoui1w1#

您实际上是在第4行中对两个不同的数据结构进行分组。您正在将grouped(分组)与avgscore(应该展平)连接起来。
你应该做:

joined = join records BY GROUPBY, avgscore BY GROUPBY USING 'replicated';

编辑:我想这样重写以避免混淆(因为将有两个groupby)

records = load 'example.csv' Using PigStorage(',') AS (ID,GROUPBY,SCORE1,SCORE2);
grouped = group records by GROUPBY;
avgscore = foreach grouped GENERATE group AS GROUPBY, AVG(records.SCORE1) AS AVGSCORE;
joined = join records BY GROUPBY, avgscore BY GROUPBY USING 'replicated';
joined_reduced = foreach joined generate ID, records::GROUPBY as GROUPBY, AVGSCORE, SCORE1, SCORE2;
filter_joined = filter joined_reduced by (SCORE1 > AVGSCORE);
grouped2 = group filter_joined by GROUPBY;
result = foreach grouped2 generate flatten (group), AVG(filter_joined.SCORE2) as low_avg;

dump result;

相关问题