确定代码中hadoop错误的原因,因为标准日志没有结论:文件分割、容器内存或块大小

u2nhd7ah  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(327)

一段时间以来,我一直在经历 log4j 日志试图确定我的hadoop作业崩溃的原因。
从本质上说,作业试图做的是在底层机器上发出一个命令,并收集该命令的输出—此时所有这些步骤都发生在Map作业中(稍后我将尝试减少这些单独输出的总和)。
我所经历的行为是-对于生成到 BufferedReader ,为了谈话- 28 其中,一切正常,工作几乎立即完成,但当我增加到这个数字 29 ,Map作业挂起 67% 完成-尝试三次-总是停在 67% 最终因缺乏进展而终止。
在发出作业的namenode中,我们可以看到以下输出:

17/10/09 15:19:29 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/10/09 15:19:29 INFO input.FileInputFormat: Total input files to process : 1
17/10/09 15:19:30 INFO mapreduce.JobSubmitter: number of splits:1
17/10/09 15:19:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1507562353923_0001
17/10/09 15:19:30 INFO impl.YarnClientImpl: Submitted application application_1507562353923_0001
17/10/09 15:19:30 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1507562353923_0001/
17/10/09 15:19:30 INFO mapreduce.Job: Running job: job_1507562353923_0001
17/10/09 15:19:36 INFO mapreduce.Job: Job job_1507562353923_0001 running in uber mode : false
17/10/09 15:19:36 INFO mapreduce.Job:  map 0% reduce 0%
17/10/09 15:19:53 INFO mapreduce.Job:  map 67% reduce 0%
17/10/09 15:30:05 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_0, Status : FAILED
AttemptID:attempt_1507562353923_0001_m_000000_0 Timed out after 600 secs
Sent signal OUTPUT_THREAD_DUMP (SIGQUIT) to pid 6230 as user ubuntu for container container_1507562353923_0001_01_000002, result=success
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

17/10/09 15:30:06 INFO mapreduce.Job:  map 0% reduce 0%
17/10/09 15:30:26 INFO mapreduce.Job:  map 67% reduce 0%
17/10/09 15:40:36 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_1, Status : FAILED
AttemptID:attempt_1507562353923_0001_m_000000_1 Timed out after 600 secs
17/10/09 15:40:37 INFO mapreduce.Job:  map 0% reduce 0%
17/10/09 15:40:52 INFO mapreduce.Job:  map 67% reduce 0%
17/10/09 15:51:05 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_2, Status : FAILED
AttemptID:attempt_1507562353923_0001_m_000000_2 Timed out after 600 secs
17/10/09 15:51:06 INFO mapreduce.Job:  map 0% reduce 0%
17/10/09 15:51:24 INFO mapreduce.Job:  map 67% reduce 0%
17/10/09 16:01:37 INFO mapreduce.Job:  map 100% reduce 100%
17/10/09 16:01:37 INFO mapreduce.Job: Job job_1507562353923_0001 failed with state FAILED due to: Task failed task_1507562353923_0001_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

17/10/09 16:01:37 INFO mapreduce.Job: Counters: 13
    Job Counters 
        Failed map tasks=4
        Killed reduce tasks=1
        Launched map tasks=4
        Other local map tasks=3
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=5025782
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=2512891
        Total time spent by all reduce tasks (ms)=0
        Total vcore-milliseconds taken by all map tasks=2512891
        Total vcore-milliseconds taken by all reduce tasks=0
        Total megabyte-milliseconds taken by all map tasks=10292801536
        Total megabyte-milliseconds taken by all reduce tasks=0

为了确定这个问题的根本原因,我花了很多时间仔细地在 log4j 输出日志,但它们不会显示任何结论性的结果—例如,对于成功的作业和不成功的作业,它们看起来几乎相同,但有一点需要注意,不成功的作业会突然终止,如下所示:

最有趣的部分似乎在这里(从前面的一次尝试(hadoop总是在完成作业时尝试3次)):

2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 214
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 450
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 217
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opStop, Time: 165
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 1977; bufvoid = 104857600
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
(&container_1507562353923_0001_01_000005???stderr0stdout105912017-10-09 16:01:35
Full thread dump OpenJDK 64-Bit Server VM (25.131-b11 mixed mode):

“全线程”这个词让我相信 MapOutputBuffer 如本文所述,我尝试了增加配置文件大小的建议解决方案,即增加 mapreduce.task.io.sort.mb ,至 1000 ,和 mapreduce.map.sort.spill.percent0.99 ,但无积极作用。
接下来我尝试的是更改程序,使其不以 Log 从log4j,而是让它写到一个文件,如这里所述,但这也没有工作。
目前,我对这个项目可能出现的问题的看法如下:

  1. MapOutputBuffer 正在填充导致程序暂停
    我应该提到的是——一个文件作为原子Map任务执行很重要——如果文件中的命令被分解成更小的组件块,那么它向底层操作系统发出的操作就没有意义了——因此:
    2) 输入文件在hdfs内部被分解,不能作为一个顺序单元被读入-顺序正确
    3) 命令在容器之间被分解,命令的一部分在一个容器中发出,而另一个容器负责处理另一部分,从而导致我刚才指定的问题。
    因此,由于日志一直没有定论,我的下一个目标是以某种方式更改代码,以便我能够隔离、测试和消除这些可能性中的每一种-->任何关于如何以有效方式做到这一点的想法--或者任何和所有关于我在这里编写的任何其他内容的评论、见解和评论都将不胜感激。
    现在Map作业是这样写的:
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    // declare logger
    private final Logger LOG = org.apache.log4j.Logger.getLogger(this.getClass());

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

      // output data struct
      List<Map.Entry<String,Integer>> pairList = new ArrayList<>();

      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
          try {
              // command execution
              Runtime rt = Runtime.getRuntime();
              String evmDir = "/home/ubuntu/go/src/github.com/ethereum/go-ethereum/build/bin/evm";
              String command = evmDir + " --debug --code " + value.toString() + " run";
              Process proc = Runtime.getRuntime().exec(command);

              LOG.warn(command);

              BufferedReader stdInput = new BufferedReader(new InputStreamReader(proc.getInputStream()));
              BufferedReader stdError = new BufferedReader(new InputStreamReader(proc.getErrorStream()));

              // define and initialise representation to hold 'evm' command output
              ArrayList<String> consoleOutput = new ArrayList<String>();
              ArrayList<String> debugOutput   = new ArrayList<String>();

              String s = null;
              while ((s = stdInput.readLine()) != null) {
                  consoleOutput.add(s);
              }
              while ((s = stdError.readLine()) != null) {
                  debugOutput.add(s);
              }

              for (String p : consoleOutput) {
                  Pattern pattern = Pattern.compile("([A-Za-z]+)([ \t]+)(\\d+)");
                  Matcher matcher = pattern.matcher(p);
                  while (matcher.find()) {
                      String opcodeName = matcher.group(1);
                      Integer executionStepTime = Integer.valueOf(matcher.group(3));
                      // add component pieces of line to output data structure
                      pairList.add(new AbstractMap.SimpleEntry<>(opcodeName, executionStepTime));
                  }
              }

          } catch (IOException e) {
              //LOG.warn(e);
              LOG.warn("Exception Encountered! " + e);
          }

          // log output for debugging
          for (Map.Entry<String, Integer> entry : pairList) {
              String opcodeRep = entry.getKey().toString();
              Integer stepTime = entry.getValue();
              LOG.warn("Code: " + opcodeRep + ", Time: " + stepTime);
          }

          word.set(itr.nextToken());
          context.write(word, one);
      }
    }
  }

日志中的附加信息可能与正确诊断此问题有关:

"VM Thread" os_prio=0 tid=0x00007f0a1007a000 nid=0x7328 runnable 

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f0a10026000 nid=0x7326 runnable 

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f0a10027800 nid=0x7327 runnable 

"VM Periodic Task Thread" os_prio=0 tid=0x00007f0a100bc800 nid=0x732f waiting on condition 

JNI global references: 277

Heap
 PSYoungGen      total 100352K, used 7502K [0x0000000780000000, 0x0000000788a00000, 0x00000007c0000000)
  eden space 94720K, 2% used [0x0000000780000000,0x0000000780216d78,0x0000000785c80000)
  from space 5632K, 95% used [0x0000000788480000,0x00000007889bcbc8,0x0000000788a00000)
  to   space 8192K, 0% used [0x0000000787a00000,0x0000000787a00000,0x0000000788200000)
 ParOldGen       total 143360K, used 113628K [0x0000000700000000, 0x0000000708c00000, 0x0000000780000000)
  object space 143360K, 79% used [0x0000000700000000,0x0000000706ef71d0,0x0000000708c00000)
 Metaspace       used 25981K, capacity 26224K, committed 26496K, reserved 1073152K
  class space    used 3019K, capacity 3083K, committed 3200K, reserved 1048576K
ncecgwcz

ncecgwcz1#

为了理解Map器的确切位置,可以使用jstack[获取线程转储]。
jstack随jdk一起提供,您可以在卡住的mapper进程中使用它,如下所示。
步骤0:找到运行map任务的主机名,并记下任务的id
步骤1:登录到节点并运行
ps aux | grep任务| id
标识进程id和以/usr/java/jdk/bin/java开头的进程的用户名 
步骤2:su到进程所有者用户名
步骤3:导出java home和bin路径[示例:export java\u home=/usr/java/jdk1.7.0\u 67&&export path=$java\u home/bin:$path]
步骤4:用步骤1中获得的pid替换pid:
export pid=以美元表示的i的pid(顺序110);do echo“jstack迭代$i”;jstack$pid>/tmp/hungtask- hostname -${pid}.jstack.$i;睡眠5s;完成tar zcvf hungtask.tar.gz/tmp/hungtask- hostname -${pid}.jstack*
hungtask.tar.gz将包含进程的线程转储,每5秒转储10次。您可能需要在任务进入挂起状态时运行脚本。
之后,如果你可以上传hungtask.tar.gz到这个线程,我可以看到并分享我的观察结果。
另外,为了了解进程是否正在进行频繁的gc,您可以尝试下面的命令
jstat-gc-t pid步骤
pid 要监视的java进程的进程id  步 是示例时间步长 
你可以把内容粘贴到网站上http://nix-on.blogspot.in/2015/01/java-jstat-how-to-visualize-garbage.html 了解它是否正在经历过度的gc

相关问题