解析amazon electronics review apache pig

apeeds0o  于 2021-05-29  发布在  Hadoop
关注(0)|答案(2)|浏览(390)

我已经加载了亚马逊电子评论数据集(http://jmcauley.ucsd.edu/data/amazon/)我的cloudera虚拟机中apache pig的5核(1689188条评论)
我关注了其他问题:
转储json数据时发生apache pig错误
复习示例 { "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" } ```
grunt> reviews = LOAD 'amazon/amazon-pro/reviews.json' USING org.apache.pig.builtin.JsonLoader('id:chararray, asin:int, reviewerName: chararray, helpful:(int), reviewText:chararray, overall:float, summary:chararray, time:int, reviewTime:chararray');

grunt> viewReview = LIMIT reviews 1;

grunt> DUMP viewReview;

我得到以下错误

2016-11-17 08:05:33,797 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT
2016-11-17 08:05:35,897 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2016-11-17 08:05:36,531 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 2
2016-11-17 08:05:36,532 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2
2016-11-17 08:05:37,577 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2016-11-17 08:05:38,183 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2016-11-17 08:05:38,225 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2016-11-17 08:05:38,230 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job974442700781595171.jar
2016-11-17 08:05:57,665 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job974442700781595171.jar created
2016-11-17 08:05:57,754 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2016-11-17 08:05:58,090 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2016-11-17 08:05:58,347 [JobControl] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2016-11-17 08:05:58,614 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.df.interval is deprecated. Instead, use fs.df.interval
2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.max.objects is deprecated. Instead, use dfs.namenode.max.objects
2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - hadoop.native.lib is deprecated. Instead, use io.native.lib.available
2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.data.dir is deprecated. Instead, use dfs.datanode.data.dir
2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.name.dir is deprecated. Instead, use dfs.namenode.name.dir
2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS
2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - fs.checkpoint.dir is deprecated. Instead, use dfs.namenode.checkpoint.dir
2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.block.size is deprecated. Instead, use dfs.blocksize
2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.access.time.precision is deprecated. Instead, use dfs.namenode.accesstime.precision
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.replication.min is deprecated. Instead, use dfs.namenode.replication.min
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.name.edits.dir is deprecated. Instead, use dfs.namenode.edits.dir
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.replication.considerLoad is deprecated. Instead, use dfs.namenode.replication.considerLoad
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.balance.bandwidthPerSec is deprecated. Instead, use dfs.datanode.balance.bandwidthPerSec
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.safemode.threshold.pct is deprecated. Instead, use dfs.namenode.safemode.threshold-pct
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.http.address is deprecated. Instead, use dfs.namenode.http-address
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.name.dir.restore is deprecated. Instead, use dfs.namenode.name.dir.restore
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.https.client.keystore.resource is deprecated. Instead, use dfs.client.https.keystore.resource
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.backup.address is deprecated. Instead, use dfs.namenode.backup.address
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.backup.http.address is deprecated. Instead, use dfs.namenode.backup.http-address
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.permissions is deprecated. Instead, use dfs.permissions.enabled
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.safemode.extension is deprecated. Instead, use dfs.namenode.safemode.extension
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.datanode.max.xcievers is deprecated. Instead, use dfs.datanode.max.transfer.threads
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.https.need.client.auth is deprecated. Instead, use dfs.client.https.need-auth
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.https.address is deprecated. Instead, use dfs.namenode.https-address
2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.replication.interval is deprecated. Instead, use dfs.namenode.replication.interval
2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - fs.checkpoint.edits.dir is deprecated. Instead, use dfs.namenode.checkpoint.edits.dir
2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.write.packet.size is deprecated. Instead, use dfs.client-write-packet-size
2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.permissions.supergroup is deprecated. Instead, use dfs.permissions.superusergroup
2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - topology.script.number.args is deprecated. Instead, use net.topology.script.number.args
2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.secondary.http.address is deprecated. Instead, use dfs.namenode.secondary.http-address
2016-11-17 08:06:00,045 [JobControl] WARN org.apache.hadoop.conf.Configuration - fs.checkpoint.period is deprecated. Instead, use dfs.namenode.checkpoint.period
2016-11-17 08:06:00,045 [JobControl] WARN org.apache.hadoop.conf.Configuration - topology.node.switch.mapping.impl is deprecated. Instead, use net.topology.node.switch.mapping.impl
2016-11-17 08:06:00,045 [JobControl] WARN org.apache.hadoop.conf.Configuration - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2016-11-17 08:06:00,217 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2016-11-17 08:06:00,270 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 11
2016-11-17 08:06:01,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201611170800_0001
2016-11-17 08:06:01,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases r,reviews
2016-11-17 08:06:01,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: reviews[1,10],r[2,4] C: R:
2016-11-17 08:06:01,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201611170800_0001
2016-11-17 08:09:30,985 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2016-11-17 08:09:31,500 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201611170800_0001 has failed! Stop running all dependent jobs
2016-11-17 08:09:31,538 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2016-11-17 08:09:31,596 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: org.codehaus.jackson.JsonParseException: Current token (VALUE_STRING) not numeric, can not use numeric value accessors
at [Source: java.io.ByteArrayInputStream@67de0c09; line: 1, column: 43]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1291)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:385)
at org.codehaus.jackson.impl.JsonNumericParserBase._parseNumericValue(JsonNumericParserBase.java:399)
at org.codehaus.jackson.impl.JsonNumericParserBase.getIntValue(JsonNumericParserBase.java:254)
at org.apache.pig.builtin.JsonLoader.readField(JsonLoader.java:189)
at org.apache.pig.builtin.JsonLoader.getNext(JsonLoader.java:157)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
at org.apache.hadoop.map
2016-11-17 08:09:31,597 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2016-11-17 08:09:31,602 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.0.0-cdh4.7.0 0.11.0-cdh4.7.0 cloudera 2016-11-17 08:05:37 2016-11-17 08:09:31 LIMIT

Failed!

Failed Jobs:
JobId Alias Feature Message Outputs
job_201611170800_0001 r,reviews Message: Job failed!

Input(s):
Failed to read data from "hdfs://localhost.localdomain:8020/user/cloudera/amazon/amazon-pro/reviews.json"

Output(s):

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201611170800_0001 -> null,
null

2016-11-17 08:09:31,602 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2016-11-17 08:09:31,635 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias r
Details at logfile: /home/cloudera/pig_1479349681179.log

omtl5h9j

omtl5h9j1#

reviews = LOAD '/user/cloudera/review.json' USING org.apache.pig.builtin.JsonLoader('reviewerID:chararray, asin:chararray,reviewerName: chararray, helpful:{t:(score:int)}, reviewText:chararray, overall:chararray, summary:chararray, Time:chararray, reviewTime:chararray');

DUMP reviews;
fcg9iug3

fcg9iug32#

我认为你的模式定义有问题 helpful . 与另一个答案相关,应该是这样的:

..., helpful:{t:(score:int)}, ...

相关问题