我希望我的python程序输出前十个最常用单词的列表及其相关的单词计数。我必须使用mrjob-mapreduce来创建这个程序。我写了一个程序,可以找到单词的频率,并将它们从头到尾输出。但是我不知道如何只输出前十个最常用的单词。我在想,也许我可以把它放在一个列表中,并使用第二个map reducer进行排序,但我不知道如何使用mapreduce来实现这一点。我用mapreduce和python编写了一个新的程序。有人能给我一些建议吗?
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
# Word frequency from book sorted by frequency
# File: book.txt
# regular expression used to identify word
WORD_REGEXP = re.compile(r"[\w']+")
class MRWordFrequencyCount(MRJob):
def steps(self):
# 2 steps
return [
MRStep(mapper=self.mapper_get_words,
reducer=self.reducer_count_words),
MRStep(mapper=self.mapper_make_counts_key,
reducer=self.reducer_output_words)
]
# Step 1
def mapper_get_words(self, _, line):
words = WORD_REGEXP.findall(line)
for w in words:
yield w.lower(), 1
def reducer_count_words(self, word, values):
yield word, sum(values)
# Step 2
def mapper_make_counts_key(self, word, count):
# sort by values
yield '%04d' % int(count), word
def reducer_output_words(self, count, words):
# First Column is the count
# Second Column is the word
for word in words:
yield count, word
if __name__ == '__main__':
MRWordFrequencyCount.run()
1条答案
按热度按时间kqqjbcuj1#
结果是键、值的无序集合。一种解决方案是转换为元组列表,因为您仍然可以维护word和count的数据关联,同时引入索引进行排序。https://docs.python.org/2/howto/sorting.html#sort-稳定性和复杂的排序,然后你可以切掉前10个最常见的