带有嵌套json的hadoop pig

dsekswqp  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(360)

我有一个按用户分级的电影列表。

{"_id":59607,"title":"King Corn (2007)",
     "genres":["Documentary"],
     "ratings":[ {"userId":1860,"rating":3},
                {"userId":9970,"rating":3.5},
                {"userId":16929,"rating":1.5},
                {"userId":23473,"rating":4},
                {"userId":23733,"rating":4},
                {"userId":27584,"rating":3},
                {"userId":28232,"rating":4},
                {"userId":29482,"rating":3},
                {"userId":40976,"rating":5},
                {"userId":44631,"rating":4},
                {"userId":47613,"rating":3},
                {"userId":49763,"rating":3},
                {"userId":58160,"rating":4.5},
                {"userId":62249,"rating":3},
                {"userId":65923,"rating":4},
                {"userId":67507,"rating":4},
                {"userId":68259,"rating":3.5},
                {"userId":70331,"rating":5},
                {"userId":71420,"rating":3.5}
        ]
    }

我需要计算一下每个用户都做了多少评分。这是我想进入收视率的尝试。

a = load '/movies_1m.json' using JsonLoader('id:int, title : chararray, genres : { ( genre : chararray ) }, ratings: { ( userId : int, rating: float) } ');

然后

b = FOREACH a GENERATE FLATTEN(ratings);

请描述以下内容:

b: {ratings::userId: int,ratings::rating: float}

只是为了计算我需要访问评级内部的用户数。但这正是它不成功的地方。我试过这个:

c = FOREACH b GENERATE COUNT(ratings);

这让我犯了个错误。
我需要这样的东西:

{userId: int, rating: float}
xyhw6mcr

xyhw6mcr1#

你需要 GROUP 为了 COUNT 因为这是一个聚合操作。

b = FOREACH a GENERATE FLATTEN(ratings);
gr = GROUP b by ratings::userId;
c = FOREACH gr GENERATE group,COUNT($1);
\d c

输出
注意,您的示例中没有一个用户重复,因此这些都是一个。

(1860,1)
(9970,1)
(16929,1)
(23473,1)
(23733,1)
(27584,1)
(28232,1)
(29482,1)
(40976,1)
(44631,1)
(47613,1)
(49763,1)
(58160,1)
(62249,1)
(65923,1)
(67507,1)
(68259,1)
(70331,1)
(71420,1)

相关问题