pig:对具有相同列a的行的列b求和

egdjgwm8  于 2021-06-21  发布在  Pig
关注(0)|答案(3)|浏览(251)

我试图计算一段时间内某个标签的tweet数量,但在尝试使用内置sum函数时出现了一个错误。
例子:

data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int,   year:int, month:int, day:int, hour:int, minute:int, second:int);
  NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';   
   NBLNabilVoto_group = GROUP NBLNabilVoto by count;
   X = FOREACH NBLNabilVoto GENERATE group, SUM(data.count);

错误:

<line 22, column 47> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
b5lpy0ml

b5lpy0ml1#

我认为你在求和中使用了错误的实数,你可以求和而不是数据实数。我有个问题你为什么要按计数分组?
如果你想用标签nblnabilvoto数一数你所有的tweet。
我想代码应该是这样的:

data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int,   year:int, month:int, day:int, hour:int, minute:int, second:int);
  NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';   
   NBLNabilVoto_group = GROUP NBLNabilVoto by all;
   X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count.count);
k10s72fa

k10s72fa2#

我不确定代码是做你认为或希望它做的,但你得到的错误是因为你正在做一个错误 SUM 在错误的事情上。你必须这么做
X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count); NBLNabilVoto_count 是数据包中元组的名称

7cwmlq89

7cwmlq893#

首先加载数据,然后筛选要处理的时间间隔。根据hashtag对记录进行分组。使用count()函数计算相应hashtag的twitter数量。

相关问题