一段时间以来,我一直在经历 log4j
日志试图确定我的hadoop作业崩溃的原因。
从本质上说,作业试图做的是在底层机器上发出一个命令,并收集该命令的输出—此时所有这些步骤都发生在Map作业中(稍后我将尝试减少这些单独输出的总和)。
我所经历的行为是-对于生成到 BufferedReader
,为了谈话- 28
其中,一切正常,工作几乎立即完成,但当我增加到这个数字 29
,Map作业挂起 67%
完成-尝试三次-总是停在 67%
最终因缺乏进展而终止。
在发出作业的namenode中,我们可以看到以下输出:
17/10/09 15:19:29 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/10/09 15:19:29 INFO input.FileInputFormat: Total input files to process : 1
17/10/09 15:19:30 INFO mapreduce.JobSubmitter: number of splits:1
17/10/09 15:19:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1507562353923_0001
17/10/09 15:19:30 INFO impl.YarnClientImpl: Submitted application application_1507562353923_0001
17/10/09 15:19:30 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1507562353923_0001/
17/10/09 15:19:30 INFO mapreduce.Job: Running job: job_1507562353923_0001
17/10/09 15:19:36 INFO mapreduce.Job: Job job_1507562353923_0001 running in uber mode : false
17/10/09 15:19:36 INFO mapreduce.Job: map 0% reduce 0%
17/10/09 15:19:53 INFO mapreduce.Job: map 67% reduce 0%
17/10/09 15:30:05 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_0, Status : FAILED
AttemptID:attempt_1507562353923_0001_m_000000_0 Timed out after 600 secs
Sent signal OUTPUT_THREAD_DUMP (SIGQUIT) to pid 6230 as user ubuntu for container container_1507562353923_0001_01_000002, result=success
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
17/10/09 15:30:06 INFO mapreduce.Job: map 0% reduce 0%
17/10/09 15:30:26 INFO mapreduce.Job: map 67% reduce 0%
17/10/09 15:40:36 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_1, Status : FAILED
AttemptID:attempt_1507562353923_0001_m_000000_1 Timed out after 600 secs
17/10/09 15:40:37 INFO mapreduce.Job: map 0% reduce 0%
17/10/09 15:40:52 INFO mapreduce.Job: map 67% reduce 0%
17/10/09 15:51:05 INFO mapreduce.Job: Task Id : attempt_1507562353923_0001_m_000000_2, Status : FAILED
AttemptID:attempt_1507562353923_0001_m_000000_2 Timed out after 600 secs
17/10/09 15:51:06 INFO mapreduce.Job: map 0% reduce 0%
17/10/09 15:51:24 INFO mapreduce.Job: map 67% reduce 0%
17/10/09 16:01:37 INFO mapreduce.Job: map 100% reduce 100%
17/10/09 16:01:37 INFO mapreduce.Job: Job job_1507562353923_0001 failed with state FAILED due to: Task failed task_1507562353923_0001_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
17/10/09 16:01:37 INFO mapreduce.Job: Counters: 13
Job Counters
Failed map tasks=4
Killed reduce tasks=1
Launched map tasks=4
Other local map tasks=3
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5025782
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=2512891
Total time spent by all reduce tasks (ms)=0
Total vcore-milliseconds taken by all map tasks=2512891
Total vcore-milliseconds taken by all reduce tasks=0
Total megabyte-milliseconds taken by all map tasks=10292801536
Total megabyte-milliseconds taken by all reduce tasks=0
为了确定这个问题的根本原因,我花了很多时间仔细地在 log4j
输出日志,但它们不会显示任何结论性的结果—例如,对于成功的作业和不成功的作业,它们看起来几乎相同,但有一点需要注意,不成功的作业会突然终止,如下所示:
最有趣的部分似乎在这里(从前面的一次尝试(hadoop总是在完成作业时尝试3次)):
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 214
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 450
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opAdd, Time: 217
2017-10-09 15:40:35,821 WARN [main] OpcodeCount$TokenizerMapper: Code: opStop, Time: 165
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 1977; bufvoid = 104857600
2017-10-09 15:40:35,851 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
(&container_1507562353923_0001_01_000005???stderr0stdout105912017-10-09 16:01:35
Full thread dump OpenJDK 64-Bit Server VM (25.131-b11 mixed mode):
“全线程”这个词让我相信 MapOutputBuffer
如本文所述,我尝试了增加配置文件大小的建议解决方案,即增加 mapreduce.task.io.sort.mb
,至 1000
,和 mapreduce.map.sort.spill.percent
至 0.99
,但无积极作用。
接下来我尝试的是更改程序,使其不以 Log
从log4j,而是让它写到一个文件,如这里所述,但这也没有工作。
目前,我对这个项目可能出现的问题的看法如下:
MapOutputBuffer
正在填充导致程序暂停
我应该提到的是——一个文件作为原子Map任务执行很重要——如果文件中的命令被分解成更小的组件块,那么它向底层操作系统发出的操作就没有意义了——因此:
2) 输入文件在hdfs内部被分解,不能作为一个顺序单元被读入-顺序正确
3) 命令在容器之间被分解,命令的一部分在一个容器中发出,而另一个容器负责处理另一部分,从而导致我刚才指定的问题。
因此,由于日志一直没有定论,我的下一个目标是以某种方式更改代码,以便我能够隔离、测试和消除这些可能性中的每一种-->任何关于如何以有效方式做到这一点的想法--或者任何和所有关于我在这里编写的任何其他内容的评论、见解和评论都将不胜感激。
现在Map作业是这样写的:
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
// declare logger
private final Logger LOG = org.apache.log4j.Logger.getLogger(this.getClass());
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// output data struct
List<Map.Entry<String,Integer>> pairList = new ArrayList<>();
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
try {
// command execution
Runtime rt = Runtime.getRuntime();
String evmDir = "/home/ubuntu/go/src/github.com/ethereum/go-ethereum/build/bin/evm";
String command = evmDir + " --debug --code " + value.toString() + " run";
Process proc = Runtime.getRuntime().exec(command);
LOG.warn(command);
BufferedReader stdInput = new BufferedReader(new InputStreamReader(proc.getInputStream()));
BufferedReader stdError = new BufferedReader(new InputStreamReader(proc.getErrorStream()));
// define and initialise representation to hold 'evm' command output
ArrayList<String> consoleOutput = new ArrayList<String>();
ArrayList<String> debugOutput = new ArrayList<String>();
String s = null;
while ((s = stdInput.readLine()) != null) {
consoleOutput.add(s);
}
while ((s = stdError.readLine()) != null) {
debugOutput.add(s);
}
for (String p : consoleOutput) {
Pattern pattern = Pattern.compile("([A-Za-z]+)([ \t]+)(\\d+)");
Matcher matcher = pattern.matcher(p);
while (matcher.find()) {
String opcodeName = matcher.group(1);
Integer executionStepTime = Integer.valueOf(matcher.group(3));
// add component pieces of line to output data structure
pairList.add(new AbstractMap.SimpleEntry<>(opcodeName, executionStepTime));
}
}
} catch (IOException e) {
//LOG.warn(e);
LOG.warn("Exception Encountered! " + e);
}
// log output for debugging
for (Map.Entry<String, Integer> entry : pairList) {
String opcodeRep = entry.getKey().toString();
Integer stepTime = entry.getValue();
LOG.warn("Code: " + opcodeRep + ", Time: " + stepTime);
}
word.set(itr.nextToken());
context.write(word, one);
}
}
}
日志中的附加信息可能与正确诊断此问题有关:
"VM Thread" os_prio=0 tid=0x00007f0a1007a000 nid=0x7328 runnable
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f0a10026000 nid=0x7326 runnable
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f0a10027800 nid=0x7327 runnable
"VM Periodic Task Thread" os_prio=0 tid=0x00007f0a100bc800 nid=0x732f waiting on condition
JNI global references: 277
Heap
PSYoungGen total 100352K, used 7502K [0x0000000780000000, 0x0000000788a00000, 0x00000007c0000000)
eden space 94720K, 2% used [0x0000000780000000,0x0000000780216d78,0x0000000785c80000)
from space 5632K, 95% used [0x0000000788480000,0x00000007889bcbc8,0x0000000788a00000)
to space 8192K, 0% used [0x0000000787a00000,0x0000000787a00000,0x0000000788200000)
ParOldGen total 143360K, used 113628K [0x0000000700000000, 0x0000000708c00000, 0x0000000780000000)
object space 143360K, 79% used [0x0000000700000000,0x0000000706ef71d0,0x0000000708c00000)
Metaspace used 25981K, capacity 26224K, committed 26496K, reserved 1073152K
class space used 3019K, capacity 3083K, committed 3200K, reserved 1048576K
1条答案
按热度按时间ncecgwcz1#
为了理解Map器的确切位置,可以使用jstack[获取线程转储]。
jstack随jdk一起提供,您可以在卡住的mapper进程中使用它,如下所示。
步骤0:找到运行map任务的主机名,并记下任务的id
步骤1:登录到节点并运行
ps aux | grep任务| id
标识进程id和以/usr/java/jdk/bin/java开头的进程的用户名
步骤2:su到进程所有者用户名
步骤3:导出java home和bin路径[示例:export java\u home=/usr/java/jdk1.7.0\u 67&&export path=$java\u home/bin:$path]
步骤4:用步骤1中获得的pid替换pid:
export pid=以美元表示的i的pid(顺序110);do echo“jstack迭代$i”;jstack$pid>/tmp/hungtask-
hostname
-${pid}.jstack.$i;睡眠5s;完成tar zcvf hungtask.tar.gz/tmp/hungtask-hostname
-${pid}.jstack*hungtask.tar.gz将包含进程的线程转储,每5秒转储10次。您可能需要在任务进入挂起状态时运行脚本。
之后,如果你可以上传hungtask.tar.gz到这个线程,我可以看到并分享我的观察结果。
另外,为了了解进程是否正在进行频繁的gc,您可以尝试下面的命令
jstat-gc-t pid步骤
pid 要监视的java进程的进程id 步 是示例时间步长
你可以把内容粘贴到网站上http://nix-on.blogspot.in/2015/01/java-jstat-how-to-visualize-garbage.html 了解它是否正在经历过度的gc