如何找到创建最多帖子数的用户的显示名?

tcomlyy6  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(302)

我的两张table:
帖子.csv-

id
post_type
creationdate
score
viewcount
owneruserid
title
answercount
commentcount

用户.csv-

id
reputation
displayname
loc
age

我知道一些逻辑,我需要分组owneruserid,然后在posts.csv下计算id。
然后将其与user.csv链接,即通过post.csv中的owneruserid和users.csv中的id加入。
请提供一些帮助。

rmbxnbpk

rmbxnbpk1#

您已经列出了基于这些步骤的逻辑。请参阅下面的script.load data,join on owneruserid,id然后按owneruserid分组。foreach group生成posts的计数。按降序对最终结果排序并获得最顶层的行。

A = LOAD 'Posts.csv' USING PigStorage(',') AS (int id,chararray:post_type,chararray:creationdate,int:score,int:viewcount,int:owneruserid,chararray:title,int:answercount,int:commentcount);
B = LOAD 'Users.csv' USING PigStorage(',') AS (int:id,int:reputation,chararray:displayname,chararray:loc,int:age);
C = JOIN A BY (owneruserid), B BY (id);
D = GROUP C BY A.owneruserid;
E = FOREACH D GENERATE group as userid,B.displayname,COUNT(A.id) as TotalPosts;
F = ORDER E BY TotalPosts DESC;
G = LIMIT 1;

相关问题