mapreduce

8dtrkrch  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(231)

我希望我的python程序输出前十个最常用单词的列表及其相关的单词计数。我必须使用mrjob-mapreduce来创建这个程序。我写了一个程序,可以找到单词的频率,并将它们从头到尾输出。但是我不知道如何只输出前十个最常用的单词。我在想,也许我可以把它放在一个列表中,并使用第二个map reducer进行排序,但我不知道如何使用mapreduce来实现这一点。我用mapreduce和python编写了一个新的程序。有人能给我一些建议吗?

from mrjob.job import MRJob
from mrjob.step import MRStep
import re

# Word frequency from book sorted by frequency

# File: book.txt

# regular expression used to identify word

WORD_REGEXP = re.compile(r"[\w']+")

class MRWordFrequencyCount(MRJob):

    def steps(self):
        # 2 steps
        return [
            MRStep(mapper=self.mapper_get_words,
                   reducer=self.reducer_count_words),
            MRStep(mapper=self.mapper_make_counts_key,
                   reducer=self.reducer_output_words)
        ]

    # Step 1
    def mapper_get_words(self, _, line):
        words = WORD_REGEXP.findall(line)
        for w in words:
            yield w.lower(), 1

    def reducer_count_words(self, word, values):
        yield word, sum(values)

    # Step 2
    def mapper_make_counts_key(self, word, count):
        # sort by values
        yield '%04d' % int(count), word

    def reducer_output_words(self, count, words):
        # First Column is the count
        # Second Column is the word
        for word in words:
            yield count, word

if __name__ == '__main__':
    MRWordFrequencyCount.run()
kqqjbcuj

kqqjbcuj1#

结果是键、值的无序集合。一种解决方案是转换为元组列表,因为您仍然可以维护word和count的数据关联,同时引入索引进行排序。https://docs.python.org/2/howto/sorting.html#sort-稳定性和复杂的排序,然后你可以切掉前10个最常见的

相关问题