我有两个数据集,
EmployeeDetail(data set 1):-
id
name
gender
location
SalaryDetail(data set 2):-
id
salary
我需要加入两者,并找出平均工资的男性和女性在每个位置。所以我试着遵循代码。
EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as
(id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as
(id:int, salary:float);
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by
id;
GroupedByLocation = group JoinedEmpDetail by location;
AverageSalary = foreach GroupedByLocation {
genderGrp = group JoinedEmpDetail by JoinedEmpDetail.EmpDetail::gender;
avgSalary = foreach genderGrp generate group,
AVG(JoinedEmpDetail.SalaryDetail::salary);
generate group as location, JoinedEmpDetail.EmpDetail::gender, avgSalary;
};
但这是错误的
<line 6, column 22> Syntax error, unexpected symbol at or near
'JoinedEmpDetail'
有谁能帮我一下,我在哪里犯的错,或者如何正确地做?
为了更清楚地了解我的要求,我提供了一些示例数据集。
empdetail.txt文件
1 Biswa Male Bangalore
12 Bratati Mahapatra Female Chennai
2 Bibhu kalyan Male Bangalore
3 Chinta Male Mumbai
10 Amrit Anand Male Bangalore
11 Sateesh panda Male Bangalore
4 Kirti Kumar Male Mumbai
6 Shruthi Female Chennai
7 Vijay Male Chennai
5 Bibhu Male Chennai
9 Bratati Mohanty Female Bangalore
8 Rupa Mahapatra Female Bangalore
13 Salini Female Mumbai
14 Priyanka Chopra Female Mumbai
empsalary.txt文件
1 10000
12 12000
2 15900
3 9000
10 8000
11 13400
4 7600
6 22000
7 17000
5 16800
9 9800
8 10000
13 11000
14 12500
我需要的最终结果是:
Mumbai male <avgsalary amount>
Mumbai female <avgsalary amount>
Bangalore male <avgsalary amount>
Bangalore female <avgsalary amount>
Chennai male <avgsalary amount>
Chennai female <avgsalary amount>
1条答案
按热度按时间sulc1iza1#
你可以用简单的方法解决这个问题
foreach stmt
所以不要使用嵌套的foreach stmt。Group command
不会在嵌套的foreach内工作,其限制在pig内。在嵌套的foreach中只允许几个命令(cross、distinct、filter、foreach、limit和order by)。你能这样改剧本吗?
输出: