在pig中不起作用

mbzjlibv  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(401)

我试图使用pig中的diff()方法找出两个表(源表和目标表)之间的差异,以便实现这一点:

sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:chararray,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);

destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:chararray,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);

cogroupnew= COGROUP sourcenew by ID inner, destnew by ID inner;

diffnew = FOREACH cogroupnew GENERATE DIFF(sourcenew,destnew);

DUMP diffnew;

给出两个表之间的差异或返回空包{}如果元组匹配,它可以正常工作,直到这一点,我的下一步是在源文件中找到目标中不存在的额外记录

cogroupextrainsource= COGROUP sourcenew by ID inner, destnew by ID;
filterextrainsource= FILTER cogroupextrainsource BY ID NOT (cogroupnew)

它的抛出错误与预期一致。需要帮助才能找到额外的来源。我们将非常感谢您的帮助。
谢谢您!

2nc8po8w

2nc8po8w1#

列名id旁边不需要$符号。$仅在不想通过名称访问列时使用。

cogroupextrainsource = COGROUP sourcenew by ID inner, destnew by ID;

相关问题