如何将java程序重写为hadoop作业?

tcbh2hod  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(234)

要使java程序适合map reduce,必须对其进行的最小绝对修改是什么?
这是我的java程序:

import java.io.*;

class evmTest {

public static void main(String[] args) {

    try {

        Runtime rt = Runtime.getRuntime();
        String command = "evm --debug --code 7f00000000000000000000000000000000000000000000000000000000000000027f00000000000000000000000000000000000000000000000000000000000000027f00000000000000000000000000000000000000000000000000000000000000020101 run";
        Process proc = rt.exec(command);

        BufferedReader stdInput = new BufferedReader(new 
             InputStreamReader(proc.getInputStream()));

        BufferedReader stdError = new BufferedReader(new 
             InputStreamReader(proc.getErrorStream()));

        // read the output from the command
        System.out.println("Here is the standard output of the command:\n");
        String s = null;
        while ((s = stdInput.readLine()) != null) {
            System.out.println(s);
        }

        // read any errors from the attempted command
        System.out.println("Here is the standard error of the command (if any):\n");
        while ((s = stdError.readLine()) != null) {
            System.out.println(s);
        }

    } catch (IOException e) {
        System.out.println(e);
    }

}

}

它打印终端的输出,以这种方式呈现:

Here is the standard output of the command:

0x
Here is the standard error of the command (if any):

#### TRACE ####

PUSH32          pc=00000000 gas=10000000000 cost=3

PUSH32          pc=00000033 gas=9999999997 cost=3
Stack:
00000000  0000000000000000000000000000000000000000000000000000000000000002

PUSH32          pc=00000066 gas=9999999994 cost=3
Stack:
00000000  0000000000000000000000000000000000000000000000000000000000000002
00000001  0000000000000000000000000000000000000000000000000000000000000002

ADD             pc=00000099 gas=9999999991 cost=3
Stack:
00000000  0000000000000000000000000000000000000000000000000000000000000002
00000001  0000000000000000000000000000000000000000000000000000000000000002
00000002  0000000000000000000000000000000000000000000000000000000000000002

ADD             pc=00000100 gas=9999999988 cost=3
Stack:
00000000  0000000000000000000000000000000000000000000000000000000000000004
00000001  0000000000000000000000000000000000000000000000000000000000000002

STOP            pc=00000101 gas=9999999985 cost=0
Stack:
00000000  0000000000000000000000000000000000000000000000000000000000000006

#### LOGS ####

当然,这是apache示例中最简单的map reduce作业之一:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

我的问题是-Map简化我在本文顶部共享的java程序最简单的方法是什么?
更新
使用以下命令运行:

$HADOOP_HOME/bin/hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.8.1.jar -D mapreduce.job.reduces=0 -input /input_0 -output /steaming-output -mapper ./mapper.sh

导致此错误:

开始在这里遇到问题:

17/09/26 03:26:56 INFO mapreduce.Job: Task Id : attempt_1506277206531_0004_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object

来自服务器的其他信息:

yacmzcpb

yacmzcpb1#

所以,这不是试图给你一个解决方案,而是朝着你应该去的方向推进。
如前所述,先做些事情。
假设你在 hdfs:///input/codes.txt ```
7f0000000002812
7f000000000281a
7f000000000281b
7f000000000281c

非常“简单”的wordcount代码实际上可以处理这些数据!但是,显然你不需要计算任何东西,你甚至不需要减速机。你有一个Map只工作,将开始这样的事情。

private final Runtime rt = Runtime.getRuntime();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String command = "evm --debug --code " + value.toString() + " run";
Process proc = rt.exec(command);

context.write( ... some_key, some_value ...);

}

但是,您实际上根本不需要java。您有一个shell命令,因此可以使用hadoop流来运行它,并将代码从hdfs“流”到 `stdin` 为了你的剧本。
那个Map绘制者看起来像这样。

!/bin/bash

mapper.sh

while read code; do
evm --debug --code $code run
done

您甚至可以在没有hadoop的情况下在本地测试代码(如果您确实需要hadoop的开销,您应该尝试使用hadoop做一个基准测试)

mapper.sh < codes.txt

由你决定,哪种方法最有效。。。对于极简主义者来说,hadoop流媒体看起来更简单。

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming*.jar
-D mapreduce.job.reduces=0
-input /input
-output /tmp/steaming-output
-mapper ~/mapper.sh

同样值得一提的是,任何标准输出/标准误差都将被收集到Yarn应用日志中,而不必返回到hdfs中。

相关问题