hadoop和nltk:使用stopwords失败

5cnsuln7 于 2021-06-03 发布在 Hadoop

关注(0)|答案(2)|浏览(372)

我正在尝试在hadoop上运行python程序。该程序涉及nltk库。该程序还利用了hadoop流api，如下所述。
Map器.py:


# !/usr/bin/env python

import sys
import nltk
from nltk.corpus import stopwords

# print stopwords.words('english')

for line in sys.stdin:
        print line,

减速机.py：


# !/usr/bin/env python

import sys
for line in sys.stdin:
    print line,

控制台命令：

bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output

它运行得很好，输出只包含输入文件的行。
但是，当此行（来自mapper.py）时：

打印stopwords.words（'英语'）

如果没有注解，则程序失败并说
作业未成功。错误：#个失败的Map任务超出了允许的限制。失败计数：1。
我在一个独立的python程序中，
打印stopwords.words（'英语'）
工作得非常好，所以我完全不明白为什么它会导致我的hadoop程序失败。
我将非常感谢任何帮助！谢谢您

hadoop mapreduce python cluster-analysis

来源：https://stackoverflow.com/questions/19057741/hadoop-and-nltk-fails-with-stopwords

2条答案

按热度按时间

qlfbtfca1#

使用以下命令解压缩：

importer = zipimport.zipimporter('nltk.zip')
    importer2=zipimport.zipimporter('yaml.zip')
    yaml = importer2.load_module('yaml')
    nltk = importer.load_module('nltk')

检查我贴在上面的链接。他们提到了所有的步骤。

赞(0）回复(0）举报 2021-06-03

6rvt4ljy2#

“english”是一个文件吗 print stopwords.words('english') ? 如果是，您需要使用 -file 也可以通过节点发送。

赞(0）回复(0）举报 2021-06-03