如何在hadoop中确定正确的Map程序数？

db2dz4w8 于 2021-06-04 发布在 Hadoop

关注(0)|答案(2)|浏览(313)

我给hadoop程序提供了一个4mb大小的输入文件（有100k条记录）。由于每个hdfs块都是64mb，并且文件只能放在一个块中，所以我选择mapper的数量为1。然而，当我增加Map器的数量（让我们坐到24），运行时间变得更好。我不知道为什么会这样？因为所有文件只能由一个Map器读取。
算法简介：使用 configure 函数，并存储在一个名为 clusters . Map程序逐行读取每个块，并找到每一行所属的集群。以下是一些代码：

public void configure(JobConf job){
        //retrieve the clusters from DistributedCache 
        try {               
            Path[] eqFile = DistributedCache.getLocalCacheFiles(job);
            BufferedReader reader = new BufferedReader(new FileReader(eqFile[0].toString()));               

            while((line=reader.readLine())!=null){
                //construct the cluster represented by ``line`` and add it to a global variable called ``clusters``

                }

            reader.close();             

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

还有Map绘制者

public void map(LongWritable key, Text value, OutputCollector<IntWritable, EquivalenceClsAggValue> output, Reporter reporter) throws IOException {
         //assign each record to one of the existing clusters in ``clusters''.

        String record = value.toString();
        EquivalenceClsAggValue outputValue = new EquivalenceClsAggValue();
        outputValue.addRecord(record);
        int eqID = MondrianTree.findCluster(record, clusters);
        IntWritable outputKey = new IntWritable(eqID);
        output.collect(outputKey,outputValue);          
    }

我有不同大小的输入文件（从4MB到4gb）。如何找到Map器/还原器的最佳数量？hadoop集群中的每个节点有2个核心，我有58个节点。

hadoop mapreduce

来源：https://stackoverflow.com/questions/16972589/how-to-determine-the-right-number-of-mappers-in-hadoop