ApacheFlink资源管理器(AppMaster)在空闲超时后第一个任务管理器被杀死后,无法分配新的任务管理器

332nm8kg  于 2021-06-26  发布在  Mesos
关注(0)|答案(0)|浏览(263)

我在ApacheMesos 1.8.1上使用ApacheFlink1.9.1,在3节点集群上对这两个节点使用或多或少的标准设置。第一个节点(192.168.10.11)运行mesos主机和flink应用程序主机。其他两个节点(192.168.10.13、192.168.10.14)都是运行mesos代理的从属节点,可以访问flink worker docker容器。flink是从docker内部运行的,但mesos master和代理是在主机上本地运行的。
一开始一切正常。我提交给flink的第一批作业在app master上被正确地执行,资源管理器与mesos协商成功,资源管理器框架在其中一个从属节点上得到一个新的worker任务。任务成功完成,客户端程序得到结果。
几分钟后,resourcemanager超时,flink worker任务(docker容器)由于空闲超时而终止。到现在为止,一直都还不错。
但在那一刻之后,如果我以与第一次完全相同的方式再次提交同一个作业,资源管理器将不做任何操作,也不会尝试创建任何新的worker。作业等待5分钟,超时后失败。
如果有人能告诉我哪里出了问题,我会很高兴的。
以下是flink应用程序主机和mesos主机的相关日志摘录:
flink应用程序主程序的启动

2019-10-24 06:36:05,191 INFO  ....entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2019-10-24 06:36:05,195 INFO  ....entrypoint.ClusterEntrypoint         -  Starting MesosSessionClusterEntrypoint (Version: 1.9.1, Rev:4d56de8, Date:30.09.2019 @ 11:32:19 CST)
2019-10-24 06:36:05,195 INFO  ....entrypoint.ClusterEntrypoint         -  OS current user: flink
2019-10-24 06:36:07,045 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.222-b10
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  Maximum heap size: 2169 MiBytes
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  JAVA_HOME: (not set)
2019-10-24 06:36:07,519 INFO  ....entrypoint.ClusterEntrypoint         -  Hadoop version: 2.8.3
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -  JVM Options:
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -     -Dlog.file=/usr/local/lib/flink-1.9.1/log/flink--mesos-appmaster-deb-node1.log
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/usr/local/lib/flink-1.9.1/conf/log4j.properties
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/usr/local/lib/flink-1.9.1/conf/logback.xml
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -  Program Arguments: (none)
(...)
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2019-10-24 06:36:07,526 INFO  ....entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2019-10-24 06:36:07,574 INFO  ....configuration.GlobalConfiguration            - Loading configuration property
----
properties summarized :
----
jobmanager.rpc.address, 192.168.10.11
jobmanager.rpc.port, 6123
jobmanager.heap.size, 1024m
taskmanager.heap.size, 1024m
taskmanager.numberOfTaskSlots, 2
parallelism.default, 2
state.checkpoints.dir, file:///var/lib/flink/data/ckeckpoints
state.savepoints.dir, file:///var/lib/flink/data/savepoints
jobmanager.execution.failover-strategy, region
rest.port, 8081
rest.bind-address, 0.0.0.0
jobmanager.archive.fs.dir, file:///var/lib/flink/completed_jobs/
historyserver.archive.fs.dir, file:///var/lib/flink/completed_jobs/
mesos.resourcemanager.tasks.container.type, docker
mesos.resourcemanager.tasks.container.image.name, eskimo:flink-worker
mesos.resourcemanager.tasks.bootstrap-cmd, export FLINK_HOME=/usr/local/lib/flink/
mesos.resourcemanager.tasks.container.volumes, /var/log/flink:/var/log/flink:RW,/var/lib/flink:/var/lib/flink:RW,/etc:/host_etc:RO
mesos.resourcemanager.tasks.cpus, 2
mesos.resourcemanager.tasks.mem, 1024
mesos.master, zk://192.168.10.11:2181/mesos
----
2019-10-24 06:36:07,753 INFO  ....entrypoint.ClusterEntrypoint         - Starting MesosSessionClusterEntrypoint.
(...)
2019-10-24 06:36:09,393 INFO  ....runtime.rpc.akka.AkkaRpcServiceUtils         - Trying to start actor system at 192.168.10.11:6123
(...)
(bla bla bla akka starting and all the rest)
(...)
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  - Registering as new framework.
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  - --------------------------------------------------------------------------------
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -  Mesos Info:
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Master URL: zk://192.168.10.11:2181/mesos
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -  Framework Info:
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     ID: (none)
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Name: Flink
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Failover Timeout (secs): 604800.0
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Role: *
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Capabilities: (none)
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Principal: (none)
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Host: 192.168.10.11
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Web UI: http://192.168.10.11:8081
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  - --------------------------------------------------------------------------------
2019-10-24 06:36:14,877 INFO  ....runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@192.168.10.11:6123/user/dispatcher was granted leadership with fencing token 00000000-0000-0000-0000-000000000000
(...)
2019-10-24 06:36:15,714 INFO  ....mesos.scheduler.ConnectionMonitor            - Connecting to Mesos...
2019-10-24 06:36:15,764 INFO  ....MesosResourceManager  - Mesos resource manager started.
(...)
2019-10-24 06:36:46,033 INFO  ....mesos.scheduler.ConnectionMonitor            - Connected to Mesos as framework ID e1432c43-f540-4d53-9cc7-b06e0f2a6525-0000.

在flink应用程序主程序启动后立即登录mesos主程序

I1024 06:36:52.096225    19 master.cpp:2920] Received SUBSCRIBE call for framework 'Flink' at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.097496    25 master.cpp:3001] Subscribing framework Flink with checkpointing enabled and capabilities [  ]
I1024 06:36:52.098603    25 master.cpp:10463] Adding framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 with roles {  } suppressed
I1024 06:36:52.099602    18 hierarchical.cpp:373] Added framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:36:52.101519    18 master.cpp:10048] Sending offers [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O1, 8977e508-0698-4c40-9317-cdc3cc8982f4-O2 ] to framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.185779    20 master.cpp:3438] Processing SUPPRESS call for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.186064    25 hierarchical.cpp:1346] Suppressed offers for roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000

在第一次提交作业时登录flink应用程序主控程序(工作正常的作业)

2019-10-24 06:37:18,800 INFO  ....runtime.dispatcher.StandaloneDispatcher      - Received JobGraph submission 50934ae259b85a449e9a789f82287305 (Flink Java Job at Thu Oct 24 06:37:17 UTC 2019).
2019-10-24 06:37:18,804 INFO  ....runtime.dispatcher.StandaloneDispatcher      - Submitting job 50934ae259b85a449e9a789f82287305 (Flink Java Job at Thu Oct 24 06:37:17 UTC 2019).
2019-10-24 06:37:18,889 INFO  ....runtime.rpc.akka.AkkaRpcService     - Starting RPC endpoint for ....JobMaster at akka://flink/user/jobmanager_0 .
2019-10-24 06:37:18,933 INFO  ....JobMaster                           - Initializing job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305).
(...)
2019-10-24 06:37:19,160 INFO  ....jobmaster.JobManagerRunner          - JobManager runner for job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0.
2019-10-24 06:37:19,175 INFO  ....JobMaster                           - Starting execution of job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) under job master id 00000000000000000000000000000000.
2019-10-24 06:37:19,177 INFO  ....ExecutionGraph                      - Job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) switched from state CREATED to RUNNING.
2019-10-24 06:37:19,191 INFO  ....ExecutionGraph                      - DataSource (at $line107.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (842c8e8d1e6b91746b428331ab8712d3) switched from CREATED to SCHEDULED.
2019-10-24 06:37:19,229 INFO  ....SlotPoolImpl                        - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{4c886029d8d6a410cde790ba6a7ac3c0}]
2019-10-24 06:37:19,265 INFO  ....JobMaster                           - Connecting to ResourceManager akka.tcp://flink@192.168.10.11:6123/user/resourcemanager(00000000000000000000000000000000)
2019-10-24 06:37:19,272 INFO  ....JobMaster                           - Resolved ResourceManager address, beginning registration
2019-10-24 06:37:19,273 INFO  ....JobMaster                           - Registration at ResourceManager attempt 1 (timeout=100ms)
2019-10-24 06:37:19,284 INFO  ....MesosResourceManager                - Registering job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0 for job 50934ae259b85a449e9a789f82287305.
2019-10-24 06:37:19,296 INFO  (...)
2019-10-24 06:37:19,305 INFO  ....JobMaster                           - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2019-10-24 06:37:19,306 INFO  ....SlotPoolImpl                        - Requesting new slot [SlotRequestId{4c886029d8d6a410cde790ba6a7ac3c0}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} from resource manager.
2019-10-24 06:37:19,311 INFO  ....MesosResourceManager                - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} for job 50934ae259b85a449e9a789f82287305 with allocation id 907586ba9ece54f20c23341276e6bac8.
2019-10-24 06:37:19,313 INFO  ....MesosResourceManager                - Starting a new worker.
2019-10-24 06:37:19,458 INFO  ....MesosResourceManager                - Scheduling Mesos task taskmanager-00001 with (1024.0 MB, 2.0 cpus).
2019-10-24 06:37:19,473 INFO  ....mesos.scheduler.LaunchCoordinator   - Now gathering offers for at least 1 task(s).
2019-10-24 06:37:19,495 INFO  ....mesos.scheduler.LaunchCoordinator   - Received offer(s) of 9432.0 MB, 4.0 cpus, 0.0 gpus:
2019-10-24 06:37:19,499 INFO  ....mesos.scheduler.LaunchCoordinator   -   8977e508-0698-4c40-9317-cdc3cc8982f4-O3 from 192.168.10.13 of 4639.0 MB, 2.0 cpus, 0.0 gpus for [*]
2019-10-24 06:37:19,500 INFO  ....mesos.scheduler.LaunchCoordinator   -   8977e508-0698-4c40-9317-cdc3cc8982f4-O4 from 192.168.10.14 of 4793.0 MB, 2.0 cpus, 0.0 gpus for [*]
2019-10-24 06:37:20,517 INFO  ....mesos.scheduler.LaunchCoordinator   - Processing 1 task(s) against 2 new offer(s) plus outstanding offers.
2019-10-24 06:37:20,578 INFO  ....mesos.scheduler.LaunchCoordinator   - Resources considered: (note: expired offers not deducted from below)
2019-10-24 06:37:20,587 INFO  ....mesos.scheduler.LaunchCoordinator   -   192.168.10.13 has 4639.0 MB, 2.0 cpus, null gpus
2019-10-24 06:37:20,841 INFO  ....MesosResourceManager                - Launching Mesos task taskmanager-00001 on host 192.168.10.14.
2019-10-24 06:37:20,841 INFO  ....mesos.scheduler.LaunchCoordinator   - Launched 1 task(s) on 192.168.10.14 using 1 offer(s):
2019-10-24 06:37:20,842 INFO  ....mesos.scheduler.LaunchCoordinator   -   8977e508-0698-4c40-9317-cdc3cc8982f4-O4
2019-10-24 06:37:20,843 INFO  ....mesos.scheduler.LaunchCoordinator   - No longer gathering offers; all requests fulfilled.
2019-10-24 06:37:20,844 INFO  com.netflix.fenzo.TaskScheduler         - Expiring all leases
2019-10-24 06:37:20,846 INFO  ....mesos.scheduler.LaunchCoordinator   - Declined offer 8977e508-0698-4c40-9317-cdc3cc8982f4-O3 from 192.168.10.13 of 4639.0 MB, 2.0 cpus.
2019-10-24 06:37:26,251 INFO  ....mesos.scheduler.TaskMonitor         - Mesos task taskmanager-00001 is running.
2019-10-24 06:37:29,575 INFO  ....MesosResourceManager                - Registering TaskManager with ResourceID taskmanager-00001 (akka.tcp://flink@192.168.10.14:31001/user/taskmanager_0) at ResourceManager
2019-10-24 06:37:29,693 INFO  ....ExecutionGraph                      - DataSource (at $line107.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (842c8e8d1e6b91746b428331ab8712d3) switched from SCHEDULED to DEPLOYING.
(...)
(everything going on fine)
(...)
2019-10-24 06:37:30,041 INFO  ....ExecutionGraph             - Job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) switched from state RUNNING to FINISHED.
2019-10-24 06:37:30,076 INFO  ....StandaloneDispatche r      - Job 50934ae259b85a449e9a789f82287305 reached globally terminal state FINISHED.
2019-10-24 06:37:30,205 INFO  ....FsJobArchivist             - Job 50934ae259b85a449e9a789f82287305 has been archived at file:/var/lib/flink/completed_jobs/50934ae259b85a449e9a789f82287305.
2019-10-24 06:37:30,206 INFO  ....JobMaster                  - Stopping the JobMaster for job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019(50934ae259b85a449e9a789f82287305).
2019-10-24 06:37:30,226 INFO  ....SlotPoolImpl               - Suspending SlotPool.
2019-10-24 06:37:30,227 INFO  ....JobMaster                  - Close ResourceManager connection 9c3fc959e219c30fba3395bb7be9823c: JobManager is shutting down..
2019-10-24 06:37:30,227 INFO  ....SlotPoolImpl               - Stopping SlotPool.
2019-10-24 06:37:30,232 INFO  ....MesosResourceManager  - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0 for job 50934ae259b85a449e9a789f82287305 from the resource manager.
2019-10-24 06:37:30,232 INFO  ....jobmaster.JobManagerRunner  - JobManagerRunner already shutdown.

在Mesos中对应的时间:

I1024 06:37:19.475060    23 hierarchical.cpp:1387] Unsuppressed offers for roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:19.475097    23 hierarchical.cpp:1409] Revived roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:19.476991    19 master.cpp:10048] Sending offers [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O3, 8977e508-0698-4c40-9317-cdc3cc8982f4-O4 ] to framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:37:20.845321    19 master.cpp:3438] Processing SUPPRESS call for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
(...)
I1024 06:37:20.856938    23 master.cpp:4590] Processing ACCEPT call for offers: [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O4 ] on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:37:20.860610    20 master.cpp:4151] Adding task taskmanager-00001 with resources cpus(allocated: *):2; mem(allocated: *):1024; ports(allocated: *):[31000-31001] of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:20.861378    20 master.cpp:5569] Launching task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 with resources [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":2.0},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"ports","ranges":{"range":[{"begin":31000,"end":31001}]},"type":"RANGES"}] on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) on  new executor
I1024 06:37:22.788302    24 master.cpp:8703] Status update TASK_STARTING (Status UUID: 240ff22e-d729-469c-a96d-aed5b19c0101) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:22.788444    24 master.cpp:8760] Forwarding status update TASK_STARTING (Status UUID: 240ff22e-d729-469c-a96d-aed5b19c0101) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:22.790689    24 master.cpp:11672] Updating the state of task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (latest state: TASK_STARTING, status update state: TASK_STARTING)
(...)
I1024 06:37:26.241962    23 master.cpp:8703] Status update TASK_RUNNING (Status UUID: d96cf8c7-5d22-4dfb-9ca5-349a847b5806) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:26.242094    23 master.cpp:8760] Forwarding status update TASK_RUNNING (Status UUID: d96cf8c7-5d22-4dfb-9ca5-349a847b5806) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:26.243048    23 master.cpp:11672] Updating the state of task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (latest state: TASK_RUNNING, status update state: TASK_RUNNING)

几分钟后,taskmanager的空闲超时开始,taskmanager被终止

2019-10-24 06:38:21,649 INFO  ....MesosResourceManager                - Stopping worker taskmanager-00001.
2019-10-24 06:38:21,672 INFO  ....MesosResourceManager                - Closing TaskExecutor connection taskmanager-00001 because: TaskExecutor exceeded the idle timeout.
2019-10-24 06:38:21,760 WARN  ....MesosResourceManager                - Discard registration from TaskExecutor taskmanager-00001 at (akka.tcp://flink@192.168.10.14:31001/user/taskmanager_0) because the framework did not recognize it
2019-10-24 06:38:26,674 INFO  ....mesos.scheduler.TaskMonitor         - Re-attempting to kill Mesos task taskmanager-00001.
2019-10-24 06:38:26,891 WARN  akka.remote.ReliableDeliverySupervisor  - Association with remote system [akka.tcp://flink@192.168.10.14:31001] has failed, address is now gated for [50] ms. Reason: [Disassociated] 
2019-10-24 06:38:27,318 INFO  ....mesos.scheduler.TaskMonitor         - Mesos task taskmanager-00001 exited as planned.
2019-10-24 06:38:27,326 INFO  ....MesosResourceManager                - Worker taskmanager-00001 finished successfully with message: Container exited with status 137

mesos中的相应日志:

I1024 06:38:26.678206    20 master.cpp:6261] Processing KILL call for task 'taskmanager-00001' of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:38:26.678450    20 master.cpp:6339] Telling agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) to kill task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:38:27.304287    20 master.cpp:8703] Status update TASK_KILLED (Status UUID: 94441e8d-02eb-4252-9014-abea6b372acc) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:38:27.304426    20 master.cpp:8760] Forwarding status update TASK_KILLED (Status UUID: 94441e8d-02eb-4252-9014-abea6b372acc) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
(...)
I1024 06:38:27.319533    19 master.cpp:11770] Removing task taskmanager-00001 with resources cpus(allocated: *):2; mem(allocated: *):1024; ports(allocated: *):[31000-31001] of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)

app master中的第二个作业提交(保持挂起并在超时后停止而不进行任何处理的作业)

2019-10-24 06:39:25,220 INFO  ....StandaloneDispatcher    - Received JobGraph submission 4888c165c1a9eae4dd69cd533c66a5c1 (Flink Java Job at Thu Oct 24 06:39:24 UTC 2019).
2019-10-24 06:39:25,220 INFO  ....StandaloneDispatcher    - Submitting job 4888c165c1a9eae4dd69cd533c66a5c1 (Flink Java Job at Thu Oct 24 06:39:24 UTC 2019).
2019-10-24 06:39:25,221 INFO  ....AkkaRpcService          - Starting RPC endpoint for ....JobMaster at akka://flink/user/jobmanager_1 .
2019-10-24 06:39:25,225 INFO  ....JobMaster               - Initializing job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,228 INFO  ....JobMaster               - Using restart strategy NoRestartStrategy for Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,230 INFO  ....ExecutionGraph          - Job recovers via failover strategy: New Pipelined Region Failover
2019-10-24 06:39:25,231 INFO  ....JobMaster               - Running initialization on master for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,232 INFO  ....JobMaster               - Successfully ran initialization on master in 1 ms.
2019-10-24 06:39:25,234 INFO  ....executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - Start building failover regions.
2019-10-24 06:39:25,234 INFO  ....executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - Created 1 failover regions.
2019-10-24 06:39:25,235 INFO  ....jobmaster.JobManagerRunner           - JobManager runner for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1.
2019-10-24 06:39:25,236 INFO  ....JobMaster               - Starting execution of job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) under job master id 00000000000000000000000000000000.
2019-10-24 06:39:25,236 INFO  ....ExecutionGraph          - Job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) switched from state CREATED to RUNNING.
2019-10-24 06:39:25,236 INFO  ....ExecutionGraph          - DataSource (at $line115.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (d7d19ea639a4adf9246f905f1765b338) switched from CREATED to SCHEDULED.
2019-10-24 06:39:25,237 INFO  ....SlotPoolImpl            - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}]
2019-10-24 06:39:25,238 INFO  ....JobMaster               - Connecting to ResourceManager akka.tcp://flink@192.168.10.11:6123/user/resourcemanager(00000000000000000000000000000000)
2019-10-24 06:39:25,239 INFO  ....JobMaster               - Resolved ResourceManager address, beginning registration
2019-10-24 06:39:25,239 INFO  ....JobMaster               - Registration at ResourceManager attempt 1 (timeout=100ms)
2019-10-24 06:39:25,240 INFO  ....MesosResourceManager  - Registering job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1 for job 4888c165c1a9eae4dd69cd533c66a5c1.
2019-10-24 06:39:25,245 INFO  ....JobMaster               - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2019-10-24 06:39:25,246 INFO  ....SlotPoolImpl            - Requesting new slot [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} from resource manager.
2019-10-24 06:39:25,246 INFO  ....MesosResourceManager  - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} for job 4888c165c1a9eae4dd69cd533c66a5c1 with allocation id 04a6f120eed690c9c7cdc8c7c0f67ed7.

在这个阶段,原木会停留几分钟。在Mesos中没有任何活动。
在我看来,问题显然出在flink应用程序主/资源管理器上,因为mesos中再次出现了问题。此时mesos主日志中没有任何活动(mesos控制台没有显示任何内容,没有提交或执行新任务,两个代理都没有执行任何操作)。
几分钟后发生超时,作业在失败状态下停止。
超时后第二个作业失败

2019-10-24 06:44:25,245 INFO  ....ExecutionGraph        - DataSource (at $line115.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) org.apache.flink.api.java.io.CollectionInputFormat)) (1/1) (d7d19ea639a4adf9246f905f1765b338) switched from SCHEDULED to FAILED.
.... NoResourceAvailableException: Could not allocate enough slots to run the job. Please make sure that the cluster has enough resources.
    at org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:460)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
    (stack traces goes on)
2019-10-24 06:44:25,261 INFO  ....executiongraph...AdaptedRestartPipelinedRegionStrategyNG  - Fail to pass the restart strategy validation in region failover. Fallback to fail global.
(...)
(a few more stack traces)
(...)
2019-10-24 06:44:25,296 INFO  ....ExecutionGraph                   - Discarding the results produced by task execution d7d19ea639a4adf9246f905f1765b338.
2019-10-24 06:44:25,296 INFO  ....SlotPoolImpl                     - Pending slot request [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}] timed out.
2019-10-24 06:44:25,301 INFO  ....dispatcher.StandaloneDispatcher  - Job 4888c165c1a9eae4dd69cd533c66a5c1 reached globally terminal state FAILED.
2019-10-24 06:44:25,340 INFO  ....history.FsJobArchivist           - Job 4888c165c1a9eae4dd69cd533c66a5c1 has been archived at file:/var/lib/flink/completed_jobs/4888c165c1a9eae4dd69cd533c66a5c1.
2019-10-24 06:44:25,341 INFO  ....JobMaster                        - Stopping the JobMaster for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019(4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:44:25,344 INFO  ....SlotPoolImpl                     - Suspending SlotPool.
2019-10-24 06:44:25,344 INFO  ....JobMaster                        - Close ResourceManager connection 9c3fc959e219c30fba3395bb7be9823c: JobManager is shutting down..
2019-10-24 06:44:25,344 INFO  ....SlotPoolImpl                     - Stopping SlotPool.
2019-10-24 06:44:25,344 INFO  ....jobmaster.JobManagerRunner       - JobManagerRunner already shutdown.
2019-10-24 06:44:25,344 INFO  ....mesos....MesosResourceManager    - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1 for job 4888c165c1a9eae4dd69cd533c66a5c1 from the resource manager.

为了记录在案,我提交的工作是一个简单的批HelloWorld从http://diegoreico.com/environments/runningflinkgclusterwithzeppelin/
更新时间:2019-10-24-下午1:00
提供以下具有调试级别的日志提取:
flink应用程序主启动:http://www.niceideas.ch/flink/1_flink_app_master_startup.txt
第一个作业执行(成功):https://www.niceideas.ch/flink/2_first_job_execution_successful.txt
任务管理器空闲超时终止:https://www.niceideas.ch/flink/3_task_manager_idle_kill.txt
第二个作业执行(失败):https://www.niceideas.ch/flink/4_second_job_execution_failed.txt
更新时间:2019-10-24-下午5:30
任务管理器(flink worker)的日志可在以下位置获得:https://www.niceideas.ch/flink/5_task_manger_logs.txt

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题