读取多个文件hadoop mapreduce分布式缓存

kiayqfof 于 2021-06-03 发布在 Hadoop

关注(0)|答案(1)|浏览(490)

我有一套文件说10个文件和一个大文件，这是所有的10个文件的总和。
我把它们放到分布式缓存中，job conf。
当我在reduce中阅读它们时，我注意到以下几点：
在reduce方法中，我只读取添加到分布式缓存中的选定文件。我希望速度更快，因为在每个reduce中读取的文件大小比在所有reduce方法中读取的大文件小。但是，速度比较慢。
而且，当我将它拆分成更小的文件并将它们添加到分布式缓存中时，问题变得更糟。作业本身只是在很长一段时间后才开始运行。
我找不到原因。请帮忙。

Java hadoop mapreduce distributed-caching distributed-computing

来源：https://stackoverflow.com/questions/13190162/reading-many-files-hadoop-mapreduce-distributed-cache

1条答案

按热度按时间

o4hqfura1#

我认为您的问题在于读取reduce（）中的文件。您应该读取configure（）中的文件（使用旧api）或setup（）中的文件（使用新api）。因此，对于每个reducer，它只被读取一次，而不是为reducer的每个输入组读取一次（基本上，每次调用reduce方法）
您可以编写如下内容：使用新的MapReduceAPI（org.apache.hadoop.mapreduce.*）-

public static class ReduceJob extends Reducer<Text, Text, Text, Text> {

    ...
Path file1;
Path file2;
...

    @Override
            protected void setup(Context context) throws IOException, InterruptedException {

                // Get the file from distributed cached
    file1 = DistributedCache.getLocalCacheFiles(context.getConfiguration())[0];
    file2 = DistributedCache.getLocalCacheFiles(context.getConfiguration())[1];

                // parse the file and get it's data in-memory for use in reduce method, probably in some ArrayList or HashMap.
            }

            @Override
            protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
                    InterruptedException {
    ...
    }
    }

使用旧的mapredapi（org.apache.hadoop.mapred.*）-

public static class ReduceJob extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

    ...
Path file1;
Path file2;
...

        @Override
        public void configure(JobConf job) {

                // Get the file from distributed cached
    file1 = DistributedCache.getLocalCacheFiles(job)[0]
    file2 = DistributedCache.getLocalCacheFiles(job)[1]
...

                // parse the file and get it's data in-memory for use in reduce method, probably in some ArrayList or HashMap.
            }

@Override
        public synchronized void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output,
                Reporter reporter) throws IOException {
    ...
    }
    }

赞(0）回复(0）举报 2021-06-03

我来回答

读取多个文件hadoop mapreduce分布式缓存

1条答案

相关问题

热门标签

最新问答