我试图使用pig中的diff()方法找出两个表(源表和目标表)之间的差异,以便实现这一点:
sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:chararray,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:chararray,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
cogroupnew= COGROUP sourcenew by ID inner, destnew by ID inner;
diffnew = FOREACH cogroupnew GENERATE DIFF(sourcenew,destnew);
DUMP diffnew;
给出两个表之间的差异或返回空包{}如果元组匹配,它可以正常工作,直到这一点,我的下一步是在源文件中找到目标中不存在的额外记录
cogroupextrainsource= COGROUP sourcenew by ID inner, destnew by ID;
filterextrainsource= FILTER cogroupextrainsource BY ID NOT (cogroupnew)
它的抛出错误与预期一致。需要帮助才能找到额外的来源。我们将非常感谢您的帮助。
谢谢您!
1条答案
按热度按时间2nc8po8w1#
列名id旁边不需要$符号。$仅在不想通过名称访问列时使用。