我跟踪了几篇文章,将nltk与hadoop流媒体作业结合起来。例如post1和post2。
我的Map程序代码如下:
# !/usr/bin/env python
'''
trainingMapper1_test.py
'''
import sys
import zipimport
importer = zipimport.zipimporter('nltk.mod')
nltk = importer.load_module('nltk')
from nltk.tokenize import WordPunctTokenizer
def getToken(input):
tokenizer = WordPunctTokenizer()
allTokens = tokenizer.tokenize(input)
return allTokens
def read_input(file):
for line in file:
yield getToken(line)
def main(separator='\t'):
data = read_input(sys.stdin)
for tokenlist in data:
for token in tokenlist:
print ('%s%s%d' % (token, separator, 1))
if __name__ == "__main__":
main()
基本上,我只是标记行,然后打印令牌的令牌。使用以下命令进行测试时,脚本正常:
python trainingMapper1_test.py < input.txt
但是,当我运行hadoop时:
hadoop jar $JARFILE \
-D mapreduce.job.reduces=0 \
-input input_dir \
-output justatest \
-file nltk.mod \
-file trainingMapper1_test.py \
-mapper trainingMapper1_test.py
如果输入目录只包含一个文件,即input.txt,则有例外:
18/04/09 01:03:31 info mapreduce.job:任务id:尝试,状态:失败错误:java.lang.runtimeexception:pipemapred.waitoutputthreads():org.apache.hadoop.streaming.pipemapred.waitoutputthreads(pipemapred)处的子进程失败,代码为1。java:325)在org.apache.hadoop.streaming.pipemapred.mapredfinished(pipemapred。java:538)在org.apache.hadoop.streaming.pipemapper.close(pipemapper。java:130)在org.apache.hadoop.mapred.maprunner.run(maprunner。java:61)在org.apache.hadoop.streaming.pipemaprunner.run(pipemaprunner。java:34)在org.apache.hadoop.mapred.maptask.runoldmapper(maptask。java:459)在org.apache.hadoop.mapred.maptask.run(maptask。java:343)在org.apache.hadoop.mapred.yarnchild$2.run(yarnchild。java:177)在javax.security.auth.subject.doas(主题)中的java.security.accesscontroller.doprivileged(本机方法)。java:422)在org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation。java:1886)在org.apache.hadoop.mapred.yarnchild.main(yarnchild。java:171)
我测试过,是导入失败从行:
importer = zipimport.zipimporter('nltk.mod')
但我不知道如何解决这个问题,似乎 nltk.mod
未发送给工人或工人无法加载。有人能帮我解决这个问题吗?
暂无答案!
目前还没有任何答案,快来回答吧!