基于多个文件对包中的元组进行排序

yzuktlbb  于 2021-06-24  发布在  Pig
关注(0)|答案(1)|浏览(259)

我正在尝试根据三个字段按降序排列一个包中的元组。。
示例:假设我通过分组创建了以下包:

{(s,3,my),(w,7,pr),(q,2,je)}

我想根据$0,$1,$2字段对上面分组包中的元组进行排序,首先它将对所有元组中的$0进行排序。它将选取值最大为0美元的元组。如果所有元组的$0都相同,那么它将按$1排序,依此类推。
通过迭代过程对所有分组的行李进行分拣。
假设我们有这样的数据包:

{(21,25,34),(21,28,64),(21,25,52)}

然后根据需求输出如下:

{(21,25,34),(21,25,52),(21,28,64)}

请让我知道如果你需要更多的澄清

iibxawm4

iibxawm41#

按嵌套的顺序排列元组 foreach . 这会有用的。
输入:

(1,s,3,my)
(1,w,7,pr)
(1,q,2,je)

A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;                                                                                            
C = FOREACH B GENERATE A;                                                                                    
D = FOREACH C {                                                                                              
 od = ORDER A BY b, c, d;                                                                                     
 GENERATE od;                                                                                                 
 };

转储c结果(类似于您的数据):

({(1,s,3,my),(1,w,7,pr),(1,q,2,je)})

输出:

({(1,q,2,je),(1,s,3,my),(1,w,7,pr)})

这将适用于所有情况。
生成具有最高值的元组:

A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;                                                                                            
C = FOREACH B GENERATE A;                                                                                    
D = FOREACH C {  
 od = ORDER A BY b desc , c desc , d desc;
 od1 = LIMIT od 1;                        
 GENERATE od1;                            
 };
dump D;

如果三个字段都不同,如果所有元组都相同,或者字段1和字段2都相同,则生成具有最大值的元组,然后返回所有元组。

A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray);
B = GROUP A BY a;                                                                                            
C = FOREACH B GENERATE A; 
F = RANK C; //rank used to separate out the value if two tuples are same                                    
R = FOREACH F {    
dis = distinct A;                                      
GENERATE rank_C,COUNT(dis) AS (cnt:long),A;                 
};
R3 = FILTER R BY cnt!=1; // filter if all the tuples are same
 R4 = FOREACH R3 {                          
 fil1 = ORDER A by b desc, c desc, d desc;
 fil2 = LIMIT fil1 1;                       
 GENERATE rank_C,fil2;                             
 }; // find largest tuple except if all the tuples are same.
R5 = FILTER R BY cnt==1; // only contains if all the tuples are same
R6 = FOREACH R5 GENERATE A ; // generate required fields
F1 = FOREACH F GENERATE rank_C,FLATTEN(A); 
F2 = GROUP F1 BY (rank_C, A::b, A::c); // group by field 1,field 2 
F3 = FOREACH F2 GENERATE COUNT(F1) AS (cnt1:long) ,F1; // if count = 2 then Tuples are same on field 1 and field 2
F4 = FILTER F3 BY cnt1==2; //separate that alone
F5 = FOREACH F4 {                    
DIS = distinct F1;                   
GENERATE flatten(DIS);
 };
F8 = JOIN F BY rank_C, F5 by rank_C;
F9 = FOREACH F8 GENERATE F::A;
Z = cross R4,F5; // cross done to genearte if all the tuples are different
Z1 = FILTER Z BY R4::rank_C!=F5::DIS::rank_C;
Z2 = FOREACH Z1 GENERATE FLATTEN(R4::fil2);
res = UNION Z2,R6,F9;  // Z2 - contains value if all the three fields in the tuple are diff holds highest value, 
//R6 - contains value if all the three fields in the tuple are same
//F9 - conatains if two fields of the tuples are same
dump res;

相关问题