通过遍历数据包获取计数,但条件应该是与该字段相关联的每个值的不同计数

kmb7vmvb  于 2021-05-30  发布在  Hadoop
关注(0)|答案(2)|浏览(366)

下面是我所拥有的数据和相同的模式是-学生姓名,问题编号,实际结果(或-错误/正确)

(b,q1,Correct)
(a,q1,false)
(b,q2,Correct)
(a,q2,false)
(b,q3,false)
(a,q3,Correct)
(b,q4,false)
(a,q4,false)
(b,q5,flase)
(a,q5,false)

我要做的是计算每个学生的正确答案和错误答案的总数,即a/b。

bf1o4zei

bf1o4zei1#

使用此选项:

data = LOAD '/abc.txt' USING PigStorage(',') AS (name:chararray, number:chararray,result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;

答案是:

(a,false,4)
(a,Correct,1)
(b,false,3)
(b,Correct,2)

希望这是你想要的产品

kcugc4gi

kcugc4gi2#

对于共享的用例,下面的pig脚本就足够了。
Pig脚本:

student_data = LOAD 'student_data.csv' USING PigStorage(',') AS (student_name:chararray, question_number:chararray, actual_result:chararray);
student_data_grp = GROUP student_data BY student_name;
student_correct_answer_data = FOREACH student_data_grp {
    answers = student_data.actual_result;
    correct_answers = FILTER answers BY actual_result=='Correct';
    incorrect_answers = FILTER answers BY actual_result=='false';
    GENERATE group AS student_name, COUNT(correct_answers) AS correct_ans_count, COUNT(incorrect_answers) AS incorrect_ans_count ;
};

输入:student\ u data.csv:

b,q1,Correct
a,q1,false
b,q2,Correct
a,q2,false
b,q3,false
a,q3,Correct
b,q4,false
a,q4,false
b,q5,false
a,q5,false

输出:转储kpi:

-- schema : (student_name, correct_ans_count, incorrect_ans_count)
(a,1,4)
(b,2,3)

参考:更多关于嵌套的细节
http://pig.apache.org/docs/r0.12.0/basic.html#foreach
http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach

相关问题