我的自定义项输出为:-
样品records:- ({(托德,1),(托德,1),(托德,1),(托德,1),(托德,5),(托德,10),(托德,20),(托德,10),(托德,10),(托德,10),(托德,10),(托德,10),(托德,10)})
({(乔恩,1),(乔恩,1),(乔恩,1),(乔恩,1),(乔恩,5),(乔恩,10),(乔恩,20),(乔恩,10),(乔恩,10),(乔恩,10),(乔恩,5),(乔恩,20),(乔恩,1)})
的架构udf:- name:字符(1个单列)
现在我想阅读这个元组包并生成如下输出:-
Todd,240
Jon,422
udf的输出存储在一个临时文件中,并使用不同的模式将其读回,如下所示:-
D = LOAD '/home/training/pig/pig/UDFdata.txt' AS (B: bag {T: tuple(name:chararray, denom:int)});
之后,我尝试使用foreach循环和引用点符号来求和。
X = foreach D generate B.T.name,SUM(B.T.denom);
2017-03-04 13:52:59507错误org.apache.pig.tools.grunt.grunt:错误1128:在中找不到字段tname:chararray,denom:int details 日志文件:/home/training/pig_.log
你能告诉我怎么找到它吗?我对apache pig还不熟悉,所以不知道它是如何遍历元组包并找到sum的。
1条答案
按热度按时间f45qwnt81#
在执行求和之前,按名称对数据集进行分组。
FLATTEN
要表演的包GROUP
.那么,
GROUP
穿上它们name
```grouped = GROUP flattened by name;
dump grouped;
(Jon,{(Jon,1),(Jon,20),(Jon,5),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,20),(Jon,10),(Jon,5),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1)})
(Todd,{(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,20),(Todd,10),(Todd,5),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1)})
final_sum = FOREACH grouped GENERATE group, SUM(flattened.denom);
dump final_sum;
(Jon,106)
(Todd,100)