8

bhmjp9jg  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(279)

我正试图为hadoop编写一组Map器/缩减器代码来计算tweet中的字数,但遇到了一点问题。我输入的文件是一个收集tweet信息的json文件。我首先将默认编码设置为utf-8,但运行代码时收到以下错误:
traceback(最近一次调用last):文件“./mapperworks2.py”,第211行,在my\u json\u dict=json.loads(line)文件中“/usr/lib/python2.6/json/init.py”,第307行,在loads return\u default\u decoder.decode(s)文件中“/usr/lib/python2.6/json/decoder.py”,第319行,在decode obj中,end=self.raw\u decode(s,idx=\u w(s,0).end())文件“/usr/lib/python2.6/json/decoder.py”,行338,在raw\u decode raise valueerror(“no json object can be decoded”)valueerror:no json object can be decoded
程序的代码在哪里


# !/usr/bin/python

import sys

import json

import string

reload(sys)
sys.setdefaultencoding('utf8')

stop_words = ['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 "can't",
 'cannot',
 'could',
 "couldn't",
 'did',
 "didn't",
 'do',
 'does',
 "doesn't",
 'yourselves']

numbers = ["0","1","2","3","4","5","6","7","8","9"]

def clean_word(word):
    for c in string.punctuation:
        word = word.replace(c,"")
    for c in numbers:
        word = word.replace(c,"")
    return word

def dont_stop(word):
    if word in stop_words or word == "":
        return False
    else:
        return True

# input comes from STDIN (standard input)

for line in sys.stdin:

############ 

############ 

############ 

############ 

    my_json_dict = json.loads(line)
    line = my_json_dict['text'].lower()

############ 

############ 

############ 

############ 

    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        ##################
        ##################
        word = clean_word(word)
        ##################
        ##################
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        ##################
        ##################
        if dont_stop(word):
            print '%s\t%s' % (word, 1)

当我不切换编码(即,注解掉重载(sys)和sys.setdefaultencoding()时,我会遇到以下错误:
回溯(最后一次调用):文件“./mapperworks2.py”,第236行,在打印“%s\t%s”中(word,1)unicodeencodeerror:“ascii”编解码器无法将字符u'\u2026'编码到位置>3:序号不在范围内(128)
不知道如何解决这个问题,任何帮助都是感激的。

qmelpv7a

qmelpv7a1#

请参阅此处的讨论:在python中管道化stdout时设置正确的编码
您的错误是试图将unicode字符串打印到输出。

相关问题