我试着在python-hadoop流媒体中导入scikit图像,我也试过stackoverflow上的现有帖子,但是没有一个能解决我的问题。
真正的问题是,即使我在打包的scikit image文件夹中使用-file-zip/mod文件分发,在数据节点上运行的python脚本如何提取这些包并导入到代码中?注意,我已经在我的name节点上安装了pythonscikit映像,并且可以运行本地实验。
我的脚本很简单:python流的经典字数计算示例,在mapper.py中有一个额外的“import skimage”。http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python
我的命令:
hadoop jar hadoop-streaming.jar \
-file mapper.py -mapper mapper.py \
-file reducer.py -reducer reducer.py \
-file ./skimage.mod \
-input /user/text/* \
-output /user/textoutput/
屏幕打印输出:
packageJobJar: [mapper.py, reducer.py, ./skimage.zip] [/usr/lib/gphd/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0/hadoop-streaming-2.0.2-alpha-gphd-2.0.1.0.jar] /tmp/streamjob6159562120374599467.jar tmpDir=null
14/04/04 18:00:02 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/04/04 18:00:02 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
14/04/04 18:00:03 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/04/04 18:00:03 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
14/04/04 18:00:03 WARN snappy.LoadSnappy: Snappy native library not loaded
14/04/04 18:00:03 INFO mapred.FileInputFormat: Total input paths to process : 1
14/04/04 18:00:03 INFO mapreduce.JobSubmitter: number of splits:2
14/04/04 18:00:03 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/04/04 18:00:03 WARN conf.Configuration: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
14/04/04 18:00:03 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/04/04 18:00:03 WARN conf.Configuration: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
14/04/04 18:00:03 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/04/04 18:00:03 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/04/04 18:00:03 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/04/04 18:00:03 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/04/04 18:00:03 WARN conf.Configuration: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
14/04/04 18:00:03 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/04/04 18:00:03 WARN conf.Configuration: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
14/04/04 18:00:03 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/04/04 18:00:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1384839777050_0106
14/04/04 18:00:04 INFO client.YarnClientImpl: Submitted application application_1384839777050_0106 to ResourceManager at hdm3.gphd.local/172.28.9.252:8032
14/04/04 18:00:04 INFO mapreduce.Job: The url to track the job: http://hdm3.gphd.local:8088/proxy/application_1384839777050_0106/
14/04/04 18:00:04 INFO mapreduce.Job: Running job: job_1384839777050_0106
14/04/04 18:00:08 INFO mapreduce.Job: Job job_1384839777050_0106 running in uber mode : false
14/04/04 18:00:08 INFO mapreduce.Job: map 0% reduce 0%
14/04/04 18:00:12 INFO mapreduce.Job: Task Id : attempt_1384839777050_0106_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
我检查了hadoop作业中的错误日志,它抱怨找不到“import skimage”,这意味着数据节点没有获取它。
1条答案
按热度按时间e1xvtsh31#
你试过这个吗
zipimport
解决方案?下面是一个示例:hadoop:如何在pythonmapreduce中包含第三方库