使用sparksql删除姓名和年龄组合上的重复项并打印结果
Name Age Location
Rajesh 21 London
Suresh 28 California
Sam 26 Delhi
Rajesh 21 Gurgaon
Manish 29 Bengaluru
CREATE TABLE DETAILS
(
NAME STRING,
AGE INT,
LOCATION STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATE BY '\t'
STORED AS TEXTFILE;
======================================================================
LOAD DATA INPATH '/FOLDER/TO/question.txt INTO DETAILS;
======================================================================
CREATE TABLE DETAILS_FILTERED AS
SELECT NAME,AGE,LOCATION FROM DETAILS GROUP BY NAME,AGE;
1条答案
按热度按时间lbsnaicq1#
使用行数或最小/最大聚合。如果你有像timestamp这样的列来选择最新/第一个记录pewr user+age,那就更好了。在这种情况下,您可以将它包含在row\u number的order\u by子句中。
Hive示例:
或者