从下面给出的示例数据中,删除name和age组合中的重复项并打印结果

ghg1uchk  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(352)

使用sparksql删除姓名和年龄组合上的重复项并打印结果

Name    Age Location
Rajesh  21  London
Suresh  28  California
Sam 26  Delhi
Rajesh  21  Gurgaon
Manish  29  Bengaluru
CREATE TABLE DETAILS
(
    NAME STRING,
    AGE INT,
    LOCATION STRING
)
ROW FORMAT DELIMITED 
FIELDS TERMINATE BY '\t'
STORED AS TEXTFILE;
======================================================================
LOAD DATA INPATH  '/FOLDER/TO/question.txt  INTO DETAILS;
======================================================================
CREATE TABLE DETAILS_FILTERED AS
SELECT NAME,AGE,LOCATION FROM DETAILS GROUP BY NAME,AGE;
lbsnaicq

lbsnaicq1#

使用行数或最小/最大聚合。如果你有像timestamp这样的列来选择最新/第一个记录pewr user+age,那就更好了。在这种情况下,您可以将它包含在row\u number的order\u by子句中。
Hive示例:

select Name,Age,Location
from 
(
select Name,Age,Location, 
       row_number() over(partition by NAME,AGE order by Location) rn --order by makes function more deterministic
from details
)
where rn=1 --filter duplicates

或者

select Name,Age,max(Location) Location  
  from details
 group by Name,Age --aggregate

相关问题