按操作筛选在pig中不起作用,不确定发生了什么?

roqulrg3  于 2021-06-24  发布在  Pig
关注(0)|答案(0)|浏览(130)

我被困在尝试提取推特使用Pig为一个特定的位置使用lat长边界。
我已经运行了下面的脚本,它一直工作,直到我过滤lat/long,然后它就死了。

我的剧本

REGISTER 'hdfs/json-simple-1.1.jar';
REGISTER 'hdfs/elephant-bird-hadoop-compat-4.1.jar';
REGISTER 'hdfs/elephant-bird-pig-4.1.jar';

-- this is just one day, there is a bunch more data, once the script is working well
-- /data/ProjectDataset/statuses.log.2014-12-31.gz
tweets_all = LOAD '/data/ProjectDataset/statuses.log.2014-12-3*' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);

-- JUST THE COORDINATES
-- to get the geo locations of tweets
tweets_all = FOREACH tweets_all GENERATE FLATTEN(json#'created_at') as time_stamp:chararray, FLATTEN(json#'id') as id:chararray, FLATTEN(json#'coordinates') as (coords_map:map[]);

-- remove duplicates
tweets = DISTINCT tweets_all;

-- filter for tweets with geo tags
filtered = FILTER tweets BY (coords_map IS NOT NULL);

-- parse the date time and unpack the geo data
locs1 = foreach filtered generate ToDate(time_stamp, 'EEE MMM dd HH:mm:ss Z yyyy') as time_stamp, coords_map#'coordinates' as coordinates:bag{t1:tuple(f1:double, f2:double)}, id as id;

-- reference longitude and latitude
locs2 = foreach locs1 generate BagToTuple(coordinates).$0 as longitude:double, BagToTuple(coordinates).$1 as latitude:double, id, time_stamp;

-- filter for tweets with geo tags with longs between (-70.0 and -80.0) and lats between (35.0 and 45.0)
geo_filtered = FILTER locs2 BY (longitude > 35) and (longitude < 45) and (latitude > -80) and (latitude < -70);

-- look at the top results
tops = limit geo_filtered 10;
dump tops;

它可以运行到locs2,因为 tops = limit locs2 5; 以及 dump tops; 退货:
(-81.9536、34.9307、549701401182351360、2014-12-29t23:00:01.000z)
(-46.455577,-23.5052585497014011869388320014-12-29t23:00:01.000z)
(179.0、81.0、549701401121990892014-12-29t23:00:01.000z)
(-4.186111、39.742536549701401203734812014-12-29t23:00:01.000z)
(12.09457957.9280885497014012079308802014-12-29t23:00:01.000z)
还有,跑步 describe locs2 结果:
locs2:{经度:double,纬度:double,id:chararray,时间戳:datetime}
它显然不喜欢locs2上的过滤操作,但我不知道为什么?
提前谢谢!

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题