我尝试在kubernetes环境中运行flink作业集群(1.8.1)。我用这个文档用我的job jar创建了docker图像。
跟随 kubefiles
创建作业、作业管理器和任务管理器。问题是任务管理器无法连接到作业管理器并继续崩溃。
调试作业管理器日志时, jobmanager.rpc.address
正在绑定到“localhost”。
但是我已经通过了kube文件中的ARG文件。
我也试过设置 jobmanager.rpc.address
in env变量( FLINK_ENV_JAVA_OPTS
).
env:
- name: FLINK_ENV_JAVA_OPTS
value: "-Djobmanager.rpc.address=flink-job-cluster"
作业管理器控制台日志:
Starting the job-cluster
Starting standalonejob as a console application on host flink-job-cluster-bbxrn.
2019-07-16 17:31:10,759 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2019-07-16 17:31:10,760 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting StandaloneJobClusterEntryPoint (Version: <unknown>, Rev:4caec0d, Date:03.04.2019 @ 13:25:54 PDT)
2019-07-16 17:31:10,760 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - OS current user: flink
2019-07-16 17:31:10,761 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Current Hadoop/Kerberos user: <no hadoop dependency found>
2019-07-16 17:31:10,761 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM: OpenJDK 64-Bit Server VM - IcedTea - 1.8/25.212-b04
2019-07-16 17:31:10,761 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Maximum heap size: 989 MiBytes
2019-07-16 17:31:10,761 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JAVA_HOME: /usr/lib/jvm/java-1.8-openjdk/jre
2019-07-16 17:31:10,761 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - No Hadoop Dependency available
2019-07-16 17:31:10,761 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM Options:
2019-07-16 17:31:10,761 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xms1024m
2019-07-16 17:31:10,761 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xmx1024m
2019-07-16 17:31:10,762 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Djobmanager.rpc.address=flink-job-cluster
2019-07-16 17:31:10,762 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog4j.configuration=file:/opt/flink-1.8.1/conf/log4j-console.properties
2019-07-16 17:31:10,762 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlogback.configurationFile=file:/opt/flink-1.8.1/conf/logback-console.xml
2019-07-16 17:31:10,762 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments:
2019-07-16 17:31:10,762 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --configDir
2019-07-16 17:31:10,762 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - /opt/flink-1.8.1/conf
2019-07-16 17:31:10,762 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --job-classname
2019-07-16 17:31:10,762 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - wikiedits.WikipediaAnalysis
2019-07-16 17:31:10,762 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --host
2019-07-16 17:31:10,762 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - flink-job-cluster
2019-07-16 17:31:10,762 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Djobmanager.rpc.address=flink-job-cluster
2019-07-16 17:31:10,763 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dparallelism.default=2
2019-07-16 17:31:10,763 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dblob.server.port=6124
2019-07-16 17:31:10,763 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dqueryable-state.server.ports=6125
2019-07-16 17:31:10,763 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Classpath: /opt/flink-1.8.1/lib/log4j-1.2.17.jar:/opt/flink-1.8.1/lib/slf4j-log4j12-1.7.15.jar:/opt/flink-1.8.1/lib/wiki-edits-0.1.jar:/opt/flink-1.8.1/lib/flink-dist_2.11-1.8.1.jar:::
2019-07-16 17:31:10,763 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2019-07-16 17:31:10,764 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Registered UNIX signal handlers for [TERM, HUP, INT]
2019-07-16 17:31:10,850 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
2019-07-16 17:31:10,851 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2019-07-16 17:31:10,851 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m
2019-07-16 17:31:10,851 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m
2019-07-16 17:31:10,851 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2019-07-16 17:31:10,851 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
以上日志显示 rpc.address
是绑定到localhost而不是 flink-job-cluster
.
我假设任务管理器的消息在绑定到localhost:6123.
2019-07-16 17:31:12,546 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 38190f2570cd5f0a0a47f65ddf7aae1f with allocation id 97af00eae7e3dfb31a79232077ea7ee6.
2019-07-16 17:31:14,043 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@flink-job-cluster:6123/]] arriving at [akka.tcp://flink@flink-job-cluster:6123] inbound addresses are [akka.tcp://flink@localhost:6123]
2019-07-16 17:31:26,564 ERROR akka.remote.EndpointWriter - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@flink-job-cluster:6123/]] arriving at [akka.tcp://flink@flink-job-cluster:6123] inbound addresses are [akka.tcp://flink@localhost:6123]
不确定作业管理器绑定到localhost的原因。
ps:任务管理器pod无法解决 flink-job-cluster
主人。主机名解析为服务ip地址。
1条答案
按热度按时间b4qexyjb1#
问题的根本原因是jobmanager.rpc.address arg值未应用。不知何故,内联args没有正确地设置为flink全局配置。但作为多行列表传递的args工作正常。