hadoop—编写可由protobufpigloader从elephant bird读取的数据

b0zn9rqh 于 2021-06-21 发布在 Pig

关注(0)|答案(0)|浏览(212)

对于我的一个项目，我想分析大约2 tb的protobuf对象。我想通过“象鸟”库在pig脚本中使用这些对象。但是，我并不完全清楚如何将文件写入hdfs，以便protobufpigloader类使用它。
这就是我所拥有的：
Pig脚本：

register ../fs-c/lib/*.jar // this includes the elephant bird library
  register ../fs-c/*.jar    
  raw_data = load 'hdfs://XXX/fsc-data2/XXX*' using com.twitter.elephantbird.pig.load.ProtobufPigLoader('de.pc2.dedup.fschunk.pig.PigProtocol.File');

导入工具（部分）：

def getWriter(filenamePath: Path) : ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File] = {
  val conf = new Configuration()
  val fs = FileSystem.get(filenamePath.toUri(), conf)
  val os = fs.create(filenamePath, true)
  val writer = new ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File](os, classOf[de.pc2.dedup.fschunk.pig.PigProtocol.File])
  return writer
}
val writer = getWriter(new Path(filename))
val builder = de.pc2.dedup.fschunk.pig.PigProtocol.File.newBuilder()
writer.write(builder.build)
writer.finish()
writer.close()

导入工具运行正常。我在protobufpigloader上遇到了一些问题，因为我无法使用hadoop lzo压缩库，如果没有修复（请参阅此处），protobufpigloader将无法工作。我遇到的问题是 DUMP raw_data; 退货 Unable to open iterator for alias raw_data 以及 ILLUSTRATE raw_data; 退货 No (valid) input data found! .
对我来说，protobufblockwriter数据似乎不能被protobufpigloader读取。但是用什么来代替呢？如何将外部工具中的数据写入hdfs，以便protobufpigloader对其进行处理。
另一个问题：用什么代替？如何将相当大的对象写入hadoop以使用pig？这些对象不是很复杂，但是在一个列表中包含一个很大的子对象列表（protobuf中的重复字段）。
我想避免使用任何文本格式或json，因为它们对于我的数据来说太大了。我希望它会将数据膨胀2或3倍（大量整数，大量字节字符串，我需要编码为base64）。。
我希望避免规范化数据，以便将主对象的id附加到每个子对象（现在就是这样做的），因为这也会增加空间消耗，并使连接在以后的处理中成为必需。
更新：
我没有使用protobuf装入器类的生成，而是使用反射类型装入器
protobuf类位于已注册的jar中。 DESCRIBE 正确显示类型。

hadoop apache-pig elephantbird

来源：https://stackoverflow.com/questions/9265040/write-data-that-can-be-read-by-protobufpigloader-from-elephant-bird

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

hadoop—编写可由protobufpigloader从elephant bird读取的数据

暂无答案！

相关问题

热门标签

最新问答