snappy文件上的hadoop python作业生成0大小的输出

qacovj5a  于 2021-06-03  发布在  Hadoop
关注(0)|答案(2)|浏览(343)

当我运行wordcount.py(python)时http://mrjob.readthedocs.org/en/latest/guides/quickstart.html#writing-你的第一个作业)在文本文件上使用hadoop流,它会给我输出,但是当对.snappy文件运行相同的操作时,我得到的输出大小为零。
尝试的选项:

[testgen word_count]# cat mrjob.conf 
runners:
  hadoop: # this will work for both hadoop and emr
    jobconf:
      mapreduce.task.timeout: 3600000
      #mapreduce.max.split.size: 20971520
      #mapreduce.input.fileinputformat.split.maxsize: 102400
      #mapreduce.map.memory.mb: 8192
      mapred.map.child.java.opts: -Xmx4294967296
      mapred.child.java.opts: -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native/

      java.library.path: /opt/cloudera/parcels/CDH/lib/hadoop/lib/native/

      # "true" must be a string argument, not a boolean! (#323)
      #mapreduce.output.compress: "true"
      #mapreduce.output.compression.codec: org.apache.hadoop.io.compress.SnappyCodec

[testgen word_count]#

命令:

[testgen word_count]# python word_count2.py -r hadoop hdfs:///input.snappy --conf mrjob.conf 
creating tmp directory /tmp/word_count2.root.20151111.113113.369549
writing wrapper script to /tmp/word_count2.root.20151111.113113.369549/setup-wrapper.sh
Using Hadoop version 2.5.0
Copying local files into hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/files/

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

Detected hadoop configuration property names that do not match hadoop version 2.5.0:
The have been translated as follows
 mapred.map.child.java.opts: mapreduce.map.java.opts
HADOOP: packageJobJar: [/tmp/hadoop-root/hadoop-unjar3623089386341942955/] [] /tmp/streamjob3671127555730955887.jar tmpDir=null
HADOOP: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
HADOOP: Total input paths to process : 1
HADOOP: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
HADOOP: Running job: job_201511021537_70340
HADOOP: To kill this job, run:
HADOOP: /opt/cloudera/parcels/CDH//bin/hadoop job  -Dmapred.job.tracker=logicaljt -kill job_201511021537_70340
HADOOP: Tracking URL: http://xxxxx_70340
HADOOP:  map 0%  reduce 0%
HADOOP:  map 100%  reduce 0%
HADOOP:  map 100%  reduce 11%
HADOOP:  map 100%  reduce 97%
HADOOP:  map 100%  reduce 100%
HADOOP: Job complete: job_201511021537_70340
HADOOP: Output: hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output
Counters from step 1:
  (no counters found)
Streaming final output from hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549/output

removing tmp directory /tmp/word_count2.root.20151111.113113.369549
deleting hdfs:///user/root/tmp/mrjob/word_count2.root.20151111.113113.369549 from HDFS
[testgen word_count]#

没有抛出错误,作业输出成功,已在作业统计中验证作业配置。
有没有其他方法来排除故障?

00jrzges

00jrzges1#

谢谢你的投入,yann,但最后在工作脚本中插入的下面一行解决了问题。

HADOOP_INPUT_FORMAT='<org.hadoop.snappy.codec>'
g6baxovj

g6baxovj2#

我认为你没有正确使用选项。
在你的 mrjob.conf 文件:
mapreduce.output.compress:“true”表示需要压缩的输出
mapreduce.output.compression.codec:org.apache.hadoop.io.compress.snappycodec表示压缩使用snappy编解码器
显然,您希望Map程序能够正确读取压缩的输入。不幸的是,它不是这样工作的。如果你真的想给你的工作提供压缩数据,你可以看看sequencefile。另一个更简单的解决方案是只给你的工作提供文本文件。
配置你的输入格式怎么样 mapreduce.input.compression.codec: org.apache.hadoop.io.compress.SnappyCodec [编辑:还应删除此符号 # 在定义选项的行的开头。否则,它们将被忽略]

相关问题