使用hadoop和pig对twitter数据进行情感分析

nlejzf6q  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(390)

twitter的tweet存储在hdfs中 hadoop . 这些推文需要经过处理,以便进行情绪分析。hdfs中的tweet是avro格式的,因此需要使用json加载器进行处理,但在pig脚本中,hdfs中的tweet没有被读取。更改jar文件后,pig脚本将显示失败消息
通过pig脚本使用以下jar文件是失败的。
注册“/home/cloudera/desktop/elephant-bird-hadoop-compat-4.17.jar”;
注册“/home/cloudera/desktop/elephant-bird-pig-4.17.jar”;
注册“/home/cloudera/desktop/json-simple-3.1.0.jar”;
这是另一组jar文件,它没有失败,但是数据也没有被读取。
注册“/home/cloudera/desktop/elephant-bird-hadoop-compat-4.17.jar”;
注册“/home/cloudera/desktop/elephant-bird-pig-4.17.jar”;
注册“/home/cloudera/desktop/json-simple-1.1.jar”;
以下是我使用过的所有pig脚本命令:

tweets = LOAD '/user/cloudera/OutputData/tweets' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;

B = FOREACH tweets GENERATE myMap#'id' as id ,myMap#'tweets' as tweets;

tokens = foreach B generate id, tweets, FLATTEN(TOKENIZE(tweets)) As word;

dictionary = load ' /user/cloudera/OutputData/AFINN.txt' using PigStorage('\t') AS(word:chararray,rating:int);

word_rating = join tokens by word left outer, dictionary by word using 'replicated';

describe word_rating;

rating = foreach word_rating generate tokens::id as id,tokens::tweets as tweets, dictionary::rating as rate;

word_group = group rating by (id,tweets);

avg_rate = foreach word_group generate group, AVG(rating.rate) as tweet_rating;

positive_tweets = filter avg_rate by tweet_rating>=0;
DUMP positive_tweets;

negative_tweets = filter avg_rate by tweet_rating<=0;

DUMP negative_tweets;

转储第一组jar文件的上述tweets命令时出错:
输入:未能从“/user/cloudera/outputdata/tweets”读取数据
输出:未能在“”中生成结果hdfs://quickstart.cloudera:8020/tmp/temp-1614543351/tmp37889715“

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1556902124324_0001

2019-05-03 09:59:09,409 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2019-05-03 09:59:09,427 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException
Details at logfile: /home/cloudera/pig_1556902594207.log

为第二组jar文件转储上述tweets命令时出错:
输入:已成功从“/user/cloudera/outputdata/tweets”读取0条记录(5178477字节)
输出:已成功将0条记录存储在:hdfs://quickstart.cloudera:8020/tmp/temp-1614543351/tmp479037703“

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1556902124324_0002

2019-05-03 10:01:05,417 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2019-05-03 10:01:05,418 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2019-05-03 10:01:05,418 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2019-05-03 10:01:05,428 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2019-05-03 10:01:05,428 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

预期输出被排序为积极和积极的推文,但得到错误。请帮忙。谢谢您。

inn6fuwd

inn6fuwd1#

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException 这通常表示pig脚本中存在语法错误。
这个 AS load语句中的关键字通常需要架构。 myMap 在load语句中不是有效的架构。
看到了吗https://stackoverflow.com/a/12829494/8886552 以jsonloader为例。

相关问题