我肯定我在做些蠢事,但这是我的错。我正在为我的udacity类“Mapreduce和hadoop的简介”做一个类作业。我们的任务是制作一个Map器/缩减器,它将统计数据集(论坛帖子主体)中某个单词的出现次数。我已经知道如何做到这一点,但我无法让python将stdin数据作为字典读入到reducer中。
到目前为止,我的方法是:mapper读取数据(在本例中是在代码中)并吐出一个word:count for 每个论坛帖子:
# !/usr/bin/python
import sys
import csv
import re
from collections import Counter
def mapper():
reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
for line in reader:
body = line[4]
#Counter(body)
words = re.findall(r'\w+', body.lower())
c = Counter(words)
#print c.items()
print dict(c)
test_text = """\"\"\t\"\"\t\"\"\t\"\"\t\"This is one sentence sentence\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Also one sentence!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Hey!\nTwo sentences!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One. Two! Three?\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One Period. Two Sentences\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Three\nlines, one sentence\n\"\t\"\"
"""
# This function allows you to test the mapper with the provided test string
def main():
import StringIO
sys.stdin = StringIO.StringIO(test_text)
mapper()
sys.stdin = sys.__stdin__
if __name__ == "__main__":
main()
然后,论坛帖子的数量会转到如下标准: {'this': 1, 'is': 1, 'one': 1, 'sentence': 2}
那么减速机应该把这个标准读入字典
# !/usr/bin/python
import sys
from collections import Counter, defaultdict
for line in sys.stdin.readlines():
print dict(line)
但是失败了,给我一个错误信息: ValueError: dictionary update sequence element #0 has length 1; 2 is required
这意味着(如果我理解正确的话)它不是以dict的形式,而是以文本字符串的形式读取每一行。如何让python理解输入行是dict?我尝试过使用counter和defaultdict,但仍然有相同的问题,或者让它作为list的元素读入每个字符,这也不是我想要的。
理想情况下,我希望Map程序读入每一行的dict,然后添加下一行的值,这样在第二行之后的值就是 {'this':1,'is':1,'one':2,'sentence':3,'also':1}
等等。
谢谢,jr
1条答案
按热度按时间pgvzfuti1#
多亏了@keyser,ast.literal\u eval()方法才适合我。以下是我现在拥有的: