HadoopYarn任务失败且没有错误消息

fivyi3re  于 2021-05-29  发布在  Hadoop
关注(0)|答案(0)|浏览(266)

我在最新的hadoop2.7.3上运行了10000000个数字。但是,yarn任务总是失败,除非集群中只有一个nodemanager在运行。此外,stdout或log中没有显式的错误消息或异常。唯一的信息是 Task Id: attempt_x Status: FAILED ,在日志中,容器或任务尝试突然从运行状态转换为失败状态。
我在互联网上找到的解决方案,包括禁用超时或调整内存配置将无法工作。事实上,对于他们的问题,日志中有错误消息或异常,而我的则完全不同。
那么这个问题的可能原因是什么呢?有什么解决办法吗?
标准如下:

hadoop jar hadoop-mapreduce-examples-2.6.4.jar terasort terasort/10m-input terasort/10m-output
16/09/06 20:51:40 INFO terasort.TeraSort: starting
16/09/06 20:51:42 INFO input.FileInputFormat: Total input paths to process : 2
Spent 201ms computing base-splits.
Spent 4ms computing TeraScheduler splits.
Computing input splits took 206ms
Sampling 8 splits of 8
Making 1 from 100000 sampled records
Computing parititions took 902ms
Spent 1112ms computing partitions.
16/09/06 20:51:43 INFO client.RMProxy: Connecting to ResourceManager at brix-409b52/192.168.1.29:8032
16/09/06 20:51:44 INFO mapreduce.JobSubmitter: number of splits:8
16/09/06 20:51:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1473165713268_0002
16/09/06 20:51:44 INFO impl.YarnClientImpl: Submitted application application_1473165713268_0002
16/09/06 20:51:44 INFO mapreduce.Job: The url to track the job: http://brix-409b52:8088/proxy/application_1473165713268_0002/
16/09/06 20:51:44 INFO mapreduce.Job: Running job: job_1473165713268_0002
16/09/06 20:51:55 INFO mapreduce.Job: Job job_1473165713268_0002 running in uber mode : false
16/09/06 20:51:55 INFO mapreduce.Job:  map 0% reduce 0%
16/09/06 20:52:11 INFO mapreduce.Job: Task Id : attempt_1473165713268_0002_m_000003_0, Status : FAILED
16/09/06 20:52:12 INFO mapreduce.Job:  map 4% reduce 0%
16/09/06 20:52:12 INFO mapreduce.Job: Task Id : attempt_1473165713268_0002_m_000000_0, Status : FAILED
16/09/06 20:52:13 INFO mapreduce.Job: Task Id : attempt_1473165713268_0002_m_000001_0, Status : FAILED
16/09/06 20:52:15 INFO mapreduce.Job:  map 6% reduce 0%
16/09/06 20:52:18 INFO mapreduce.Job:  map 8% reduce 0%
16/09/06 20:52:23 INFO mapreduce.Job:  map 13% reduce 0%
16/09/06 20:52:26 INFO mapreduce.Job: Task Id : attempt_1473165713268_0002_m_000003_1, Status : FAILED
16/09/06 20:52:28 INFO mapreduce.Job: Task Id : attempt_1473165713268_0002_m_000000_1, Status : FAILED
16/09/06 20:52:29 INFO mapreduce.Job: Task Id : attempt_1473165713268_0002_m_000001_1, Status : FAILED
16/09/06 20:52:32 INFO mapreduce.Job:  map 21% reduce 0%
16/09/06 20:52:35 INFO mapreduce.Job:  map 25% reduce 0%
16/09/06 20:52:42 INFO mapreduce.Job: Task Id : attempt_1473165713268_0002_m_000003_2, Status : FAILED
16/09/06 20:52:44 INFO mapreduce.Job: Task Id : attempt_1473165713268_0002_m_000005_0, Status : FAILED
16/09/06 20:52:44 INFO mapreduce.Job: Task Id : attempt_1473165713268_0002_m_000000_2, Status : FAILED
16/09/06 20:52:45 INFO mapreduce.Job:  map 29% reduce 0%
16/09/06 20:52:48 INFO mapreduce.Job:  map 31% reduce 0%
16/09/06 20:52:51 INFO mapreduce.Job:  map 32% reduce 0%
16/09/06 20:52:54 INFO mapreduce.Job:  map 33% reduce 0%
16/09/06 20:52:55 INFO mapreduce.Job: Task Id : attempt_1473165713268_0002_r_000000_0, Status : FAILED
16/09/06 20:52:56 INFO mapreduce.Job:  map 38% reduce 0%
16/09/06 20:52:59 INFO mapreduce.Job:  map 100% reduce 100%
16/09/06 20:53:00 INFO mapreduce.Job: Job job_1473165713268_0002 failed with state FAILED due to: Task failed task_1473165713268_0002_m_000003
Job failed as tasks failed. failedMaps:1 failedReduces:0

16/09/06 20:53:01 INFO mapreduce.Job: Counters: 42
    File System Counters
        FILE: Number of bytes read=418759260
        FILE: Number of bytes written=837878859
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=402653508
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0

此外,下面是详细的nodemanager日志的一部分,其中显示了容器或任务尝试的失败:

2016-09-06 21:05:33,015 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Adding container_1473165713268_0003_01_000012 to application application_1473165713268_0003
2016-09-06 21:05:33,015 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1473165713268_0003_01_000012 transitioned from NEW to LOCALIZING
2016-09-06 21:05:33,015 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1473165713268_0003
2016-09-06 21:05:33,015 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event APPLICATION_INIT for appId application_1473165713268_0003
2016-09-06 21:05:33,015 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got APPLICATION_INIT for service mapreduce_shuffle
2016-09-06 21:05:33,015 INFO org.apache.hadoop.mapred.ShuffleHandler: Added token for job_1473165713268_0003
2016-09-06 21:05:33,016 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1473165713268_0003_01_000012 transitioned from LOCALIZING to LOCALIZED
2016-09-06 21:05:33,027 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1473165713268_0003_000001 (auth:SIMPLE)
2016-09-06 21:05:33,036 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1473165713268_0003_01_000012
2016-09-06 21:05:33,036 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=excelle08    IP=127.0.0.1    OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1473165713268_0003    CONTAINERID=container_1473165713268_0003_01_000012
2016-09-06 21:05:33,039 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1473165713268_0003_01_000012 transitioned from LOCALIZED to KILLING
2016-09-06 21:05:33,039 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1473165713268_0003_01_000012
2016-09-06 21:05:33,039 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_1473165713268_0003_01_000012 not launched. No cleanup needed to be done
2016-09-06 21:05:33,055 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_1473165713268_0003_01_000012 not launched as cleanup already called
2016-09-06 21:05:33,060 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1473165713268_0003_01_000012 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2016-09-06 21:05:33,061 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=excelle08    OPERATION=Container Finished - Killed   TARGET=ContainerImpl    RESULT=SUCCESS  APPID=application_1473165713268_0003    CONTAINERID=container_1473165713268_0003_01_000012
2016-09-06 21:05:33,061 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1473165713268_0003_01_000012 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2016-09-06 21:05:33,061 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Removing container_1473165713268_0003_01_000012 from application application_1473165713268_0003
2016-09-06 21:05:33,061 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1473165713268_0003
2016-09-06 21:05:34,064 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1473165713268_0003_01_000011]
2016-09-06 21:05:34,473 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1473165713268_0003_01_000011
2016-09-06 21:05:34,474 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1473165713268_0003_01_000012
2016-09-06 21:05:34,481 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 14187 for container-id container_1473165713268_0003_01_000001: 291.4 MB of 2 GB physical memory used; 2.7 GB of 6 GB virtual memory used
2016-09-06 21:05:35,067 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1473165713268_0003_01_000012]

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题