我正在尝试使用cloudera分析twitter数据。目前,我可以通过flume将twitter数据流到hdfs中,但在尝试使用配置单元表中的sql查询数据时遇到问题,出现以下异常:
java.io.IOException: org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
这是否意味着数据已加载到配置单元中,但无法查询,或者根本没有加载到配置单元中?
我的flume.conf文件是
TwitterAgent.sources = Twitter
TwitterAgent.channels = FileChannel
TwitterAgent.sinks = HDFS
# TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = FileChannel
TwitterAgent.sources.Twitter.consumerKey = nmmRpbWjQPAViWlJLjkJuq7mO
TwitterAgent.sources.Twitter.consumerSecret =*****
TwitterAgent.sources.Twitter.accessToken =*****
TwitterAgent.sources.Twitter.accessTokenSecret =*****
TwitterAgent.sources.Twitter.maxBatchSize = 50000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100
# TwitterAgent.sources.Twitter.keywords = Canada, TTC,ttc, Toronto, Free, and, Apache,city, City, Hadoop, Mapreduce, hadooptutorial, Hive, Hbase, MySql
TwitterAgent.sinks.HDFS.channel = FileChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://quickstart.cloudera:8020/user/hive/warehouse/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 100
TwitterAgent.channels.FileChannel.type = file
TwitterAgent.channels.FileChannel.checkpointDir = /var/log/flume-ng/checkpoint/
TwitterAgent.channels.FileChannel.dataDirs = /var/log/flume-ng/data/
我添加了jar文件“hive-serdes-1.0-snapshot.jar”
ADD JAR /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar
我的.avsc位置是'/home/cloudera/twitterdataavroschema.avsc',代码如下-
{"type":"record",
"name":"Doc",
"doc":"adoc",
"fields":[{"name":"id","type":"string"},
{"name":"user_friends_count","type":["int","null"]},
{"name":"user_location","type":["string","null"]},
{"name":"user_description","type":["string","null"]},
{"name":"user_statuses_count","type":["int","null"]},
{"name":"user_followers_count","type":["int","null"]},
{"name":"user_name","type":["string","null"]},
{"name":"user_screen_name","type":["string","null"]},
{"name":"created_at","type":["string","null"]},
{"name":"text","type":["string","null"]},
{"name":"retweet_count","type":["long","null"]},
{"name":"retweeted","type":["boolean","null"]},
{"name":"in_reply_to_user_id","type":["long","null"]},
{"name":"source","type":["string","null"]},
{"name":"in_reply_to_status_id","type":["long","null"]},
{"name":"media_url_https","type":["string","null"]},
{"name":"expanded_url","type":["string","null"]}
]
}
用于创建配置单元表
CREATE TABLE my_tweets
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='file:///home/cloudera/twitterDataAvroSchema.avsc') ;
使用以下命令将数据上载到配置单元表
LOAD DATA INPATH '/user/hive/warehouse/tweets/FlumeData.*' OVERWRITE INTO TABLE my_tweets;
== output ===
Loading data to table robin.my_tweets
Table robin.my_tweets stats: [numFiles=1, numRows=0, totalSize=421380, rawDataSize=0]
OK
Time taken: 1.928 seconds
从中尝试sql时出错
错误
hive> select user_location from robin.my_tweets;
OK
失败,出现异常java.io.ioexception:org.apache.avro.avroruntimeexception:java.io.ioexception:块大小对此实现无效或太大:-40
所用时间:1.247秒
我使用的是cloureda version=2.6.0-cdh5.5.0
在此问题上的任何协助都将不胜感激。
谢谢
罗宾
暂无答案!
目前还没有任何答案,快来回答吧!