我试图使用象鸟来查询一些示例protobuf数据。我使用的是addressbook示例,我将一些假的addressbook序列化为文件,并将它们放在hdfs中的/user/foo/data/elephant bird/addressbooks/查询不返回任何结果
我这样设置表和查询:
add jar /home/foo/downloads/elephant-bird/hadoop-compat/target/elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar;
add jar /home/foo/downloads/elephant-bird/core/target/elephant-bird-core-4.6-SNAPSHOT.jar;
add jar /home/foo/downloads/elephant-bird/hive/target/elephant-bird-hive-4.6-SNAPSHOT.jar;
create external table addresses
row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
with serdeproperties (
"serialization.class"="com.twitter.data.proto.tutorial.AddressBookProtos$AddressBook")
STORED AS
-- elephant-bird provides an input format for use with hive
INPUTFORMAT "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
-- placeholder as we will not be writing to this table
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION '/user/foo/data/elephant-bird/addressbooks/';
describe formatted addresses;
OK
# col_name data_type comment
person array{ struct{ name:string, id:int, email:string, phone:array {struct {number:string, type:string}}}} from deserializer
byteData binary from deserializer
# Detailed Table Information
Database: default
Owner: foo
CreateTime: Tue Oct 28 13:49:53 PDT 2014
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://foo:8020/user/foo/data/elephant-bird/addressbooks
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
transient_lastDdlTime 1414529393
# Storage Information
SerDe Library: com.twitter.elephantbird.hive.serde.ProtobufDeserializer
InputFormat: com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.class com.twitter.data.proto.tutorial.AddressBookProtos$AddressBook
serialization.format 1
Time taken: 0.421 seconds, Fetched: 29 row(s)
当我尝试选择数据时,它不返回任何结果(似乎不读取行):
select count(*) from addresses;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_1413311929339_0061, Tracking URL = http://foo:8088/proxy/application_1413311929339_0061/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1413311929339_0061
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 1
2014-10-28 13:50:37,674 Stage-1 map = 0%, reduce = 0%
2014-10-28 13:50:51,055 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 2.14 sec
2014-10-28 13:50:52,152 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 2.14 sec
MapReduce Total cumulative CPU time: 2 seconds 140 msec
Ended Job = job_1413311929339_0061
MapReduce Jobs Launched:
Job 0: Reduce: 1 Cumulative CPU: 2.14 sec HDFS Read: 0 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 140 msec
OK
0
Time taken: 37.519 seconds, Fetched: 1 row(s)
如果我创建一个非外部表或者显式地将数据导入到外部表中,我会看到同样的情况。
我的安装程序的版本信息:
Thrift 0.7
protobuf: libprotoc 2.5.0
hadoop:
Hadoop 2.5.0-cdh5.2.0
Subversion http://github.com/cloudera/hadoop -r e1f20a08bde76a33b79df026d00a0c91b2298387
Compiled by jenkins on 2014-10-11T21:00Z
Compiled with protoc 2.5.0
From source with checksum 309bccd135b199bdfdd6df5f3f4153d
更新:
我在日志中看到这个错误。我在hdfs中的数据只是原始的protobuf(没有压缩)。我想弄清楚这是否是问题所在,我是否能读懂原始二进制协议。
Error: java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:346)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.(HadoopShimsSecure.java:293)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:407)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:560)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:168)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:409)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:332)
... 11 more
Caused by: java.io.IOException: No codec for file hdfs://foo:8020/user/foo/data/elephantbird/addressbooks/1000AddressBooks-1684394246.bin found
at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.determineFileFormat(MultiInputFormat.java:176)
at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.createRecordReader(MultiInputFormat.java:88)
at com.twitter.elephantbird.mapreduce.input.RawMultiInputFormat.createRecordReader(RawMultiInputFormat.java:36)
at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.(DeprecatedInputFormatWrapper.java:256)
at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getRecordReader(DeprecatedInputFormatWrapper.java:121)
at com.twitter.elephantbird.mapred.input.DeprecatedFileInputFormatWrapper.getRecordReader(DeprecatedFileInputFormatWrapper.java:55)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:65)
... 16 more
1条答案
按热度按时间jq6vz3qz1#
你解决这个问题了吗?
我的问题和你说的一样。
是的,你是对的,我发现原始二进制协议不能直接读取。
这就是我问过的问题。使用带Hive的象鸟读取protobuf数据
希望有帮助
致以最诚挚的问候