为什么我的hadoop输出是文件的许多部分？

ldfqzlk8 于 2021-06-03 发布在 Hadoop

关注(0)|答案(3)|浏览(181)

我试着数一数单词的频率，然后写下文件： mapper.py :


# !/usr/bin/env python

import sys

# input comes from STDIN (standard input)

for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)

使用hadoop语句： hadoop streaming \ -input "/app/hadoop_learn_test/book.txt" \ -mapper "python mapper.py" \ -reducer "cat" \ -output "/app/hadoop_learn_test/book_out" \ -file "mapper.py" \ 这个 book.txt 是：

foo foo quux labs foo bar quux

但我有400个文件名为 part-00000.gz 当我使用 hadoop dfs -cat path 我什么也没有得到。
为什么我不能得到结果？
我用 cat book.txt | python mapper.py | sort 在本地终端得到以下信息：

bar     1
foo     1
foo     1
foo     1
labs    1
quux    1
quux    1

hadoop python

来源：https://stackoverflow.com/questions/24532050/why-my-hadoop-output-is-many-parts-of-file

3条答案

按热度按时间

unhi4e5o1#

我想你需要用柜台


# !/usr/bin/env python

import sys
from collection import Counter

# input comes from STDIN (standard input)

for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    wordcount=Counter(words)

    for word,count in wordcount.items():

        print '%s\t%s' % (word, count)

赞(0）回复(0）举报 2021-06-04

k10s72fa2#

尝试将mapred.reduce.tasks属性设置为1。
您可以使用-d mapred.reduce.tasks=1将它传递给hadoop命令行
添加：map red作业中的每个reducer都创建一个输出。所以如果你有400个文件，基本上你有400个减速机。

赞(0）回复(0）举报 2021-06-04

c9x0cxw03#

之所以有许多输出文件，是因为hadoop是一个将进程分发到许多计算机（和许多cpu）的框架。如果最后只需要一个文件，那么只能使用一个进程（/thread）来完成，这样就避免了hadoop的整个要点。
要cat整个输出，只需使用通配符：