使用pythonmrjob在emr上引导库

vcirk6k6  于 2021-06-03  发布在  Hadoop
关注(0)|答案(2)|浏览(437)

问题陈述:

我正在尝试使用pythonmrjob库在amazonemr中运行map reduce作业,但是在用必需的库和包引导节点时遇到了问题。

详情:

我的样品 python mrjob 代码:

import re
    from mrjob.job import MRJob
    from sentClassifier import sentClassify
    import nltk

    .. do something ..

有一些像nltk这样的库需要导入,还有一些我正在导入的本地模块 from sentClassifier import sentClassify 我想知道什么是最好的方式来引导emr节点,使这些方法和包是可用的。代码在我的本地机器上运行良好。
我的样品 mrjob.conf 文件:

runners:
      emr:
        aws_access_key_id:***
        aws_secret_access_key:***
        ec2_core_instance_type: m1.large
        ec2_key_pair: mykey
        ec2_key_pair_file: mykey.pem
        num_ec2_core_instances: 5
        pool_wait_minutes: 2
        pool_emr_job_flows: true
        ssh_tunnel_is_open: true
        ssh_tunnel_to_job_tracker: true
      hadoop:
        setup:
          - virtualenv venv
          - . venv/bin/activate
          - pip install mr3po simplejson
          - sudo easy_install https://code.google.com/p/nltk/downloads/detail?name=nltk-2.0b9-py2.6.egg&can=2&q=

但工作失败了。
我通读了以下参考资料,尝试了他们的各种方法,但仍然没有成功:
mrjob文档emr runner配置
mrjob文档hadoop runner配置
salmon run的博客文章
错误日志:

Scanning SSH logs for probable cause of failure
    Probable cause of failure (from ssh://ec2-54-86-50-115.compute-1.amazonaws.com!172.31.19.60/mnt/var/log/hadoop/userlogs/job_201405030101_0006/attempt_201405030101_0006_m_000002_3/stderr):
    Traceback (most recent call last):
    File "obidroidMR.py", line 5, in <module>
       import nltk
       ImportError: No module named nltk
       (while reading from s3://mrjob-   51b9493c1a467671/tmp/obidroidMR.shreyas.20140503.012933.336228/files/STDIN)
       Attempting to terminate job...
       Job appears to have already been terminated
       Killing our SSH tunnel (pid 12909)
       Traceback (most recent call last):
         File "obidroidMR.py", line 107, in <module>
         ObidroidReview.run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
         mr_job.execute()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
         self.run_job()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/emr.py", line 809, in _run
         self._wait_for_job_to_complete()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/emr.py", line 1599, in _wait_for_job_to_complete
         raise Exception(msg)
         Exception: Job on job flow j-2R8G1Q3RIE9ED failed with status WAITING: Waiting after step failed
         Probable cause of failure (from ssh://ec2-54-86-50-115.compute-1.amazonaws.com!172.31.19.60/mnt/var/log/hadoop/userlogs/job_201405030101_0006/attempt_201405030101_0006_m_000002_3/stderr):
         Traceback (most recent call last):
         File "obidroidMR.py", line 5, in <module>
         import nltk
         ImportError: No module named nltk

任何帮助都将不胜感激

nwlls2ji

nwlls2ji1#

考虑到amazonlasticmap reduce使用基于amazonlinux的ami,我验证了我可以安装 nltk 在amazon linux ami 2014.03.1-ami-fb8e9292(64位)上使用以下命令

sudo easy_install -U pip
sudo easy_install -U distribute
sudo pip install -U pyyaml nltk

您可以尝试将这3行代码合并到mrjob.conf中

dauxcl2d

dauxcl2d2#

mrjob.conf 安装软件包所需的线路可能不在其应位于的位置。在emr上运行的作业应该应用的内容应该列在下面 emr: 而不是 hadoop: (这是在本地hadoop安装上运行作业时的配置。
如果是一个简单的linux命令 pip 或者 apt-get ,则您应该能够安装以下软件包:

runners:
  emr:
    aws_access_key_id:***
    ... all the other stuff ...
    bootstrap_cmds:
    - sudo apt-get install -y python-boto
    - sudo pip install simplejson

我从来没有试图安装nltk特别,所以我不能帮助你,但你应该能够安装沿着这条线。
对于可能更复杂的安装,我建议 ssh 使用emr cli连接到主节点:

$ ./elastic-mapreduce -j JOB_FLOW_ID --ssh

并尝试安装软件包。如果您发现一系列shell命令成功地安装了包,那么您可以简单地复制并粘贴到您的应用程序中 mrjob.conf .

相关问题