flink taskmanager docker swarm无法恢复

py49o6xq  于 2021-06-24  发布在  Flink
关注(0)|答案(1)|浏览(370)

我正在运行一个Flinkv1.10,DockerSwarm中有1个jobmanager和3个TaskManager,没有zookeeper。我有一个工作运行12插槽和我有3个tm的20插槽每个(共60)。经过几次测试,除了一次测试外,一切都很顺利。
因此,测试失败的原因是,如果我手动取消作业,我会让一辆侧车重试作业,而浏览器控制台上的taskmanager不会恢复并不断减少。
更实际的例子是,我有一个正在运行的作业,总共消耗了60个插槽中的12个插槽。
web控制台显示48个可用插槽和3个tm。
我手动取消工作侧车重新触发的工作和网络控制台显示我36个插槽免费和2 tm的
作业进入失败状态,插槽将继续下降,直到控制台上显示0个可用插槽和1个tm。
解决方案是缩小和扩大所有的3 tm的和一切恢复正常。
使用此配置一切都正常,jobmanager恢复是如果我删除它,或者如果我放大或缩小tm,但是如果我取消作业,tm看起来像是断开了与jm的连接。
有什么建议我做错了什么吗?
这是我的flink-conf.yaml。

env.java.home: /usr/local/openjdk-8
env.log.dir: /opt/flink/
env.log.file: /var/log/flink.log
jobmanager.rpc.address: jobmanager1
jobmanager.rpc.port: 6123

jobmanager.heap.size: 2048m

# taskmanager.memory.process.size: 2048m

# env.java.opts.taskmanager: 2048m

taskmanager.memory.flink.size: 2048m

taskmanager.numberOfTaskSlots: 20

parallelism.default: 2

# ==============================================================================

# High Availability

# ==============================================================================

# The high-availability mode. Possible options are 'NONE' or 'zookeeper'.

# 

high-availability: NONE

# high-availability.storageDir: file:///tmp/storageDir/flink_tmp/

# high-availability.zookeeper.quorum: zookeeper1:2181,zookeeper2:2181,zookeeper3:2181

# high-availability.zookeeper.quorum:

# ACL options are based on https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html#sc_BuiltinACLSchemes

# high-availability.zookeeper.client.acl: open

# ==============================================================================

# Fault tolerance and checkpointing

# ==============================================================================

# state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints

# state.savepoints.dir: hdfs://namenode-host:port/flink-checkpoints

# state.backend.incremental: false

jobmanager.execution.failover-strategy: region

# ==============================================================================

# Rest & web frontend

# ==============================================================================

rest.port: 8080
rest.address: jobmanager1

# rest.bind-port: 8081

rest.bind-address: 0.0.0.0

# web.submit.enable: false

# ==============================================================================

# Advanced

# ==============================================================================

# io.tmp.dirs: /tmp

# classloader.resolve-order: child-first

# taskmanager.memory.network.fraction: 0.1

# taskmanager.memory.network.min: 64mb

# taskmanager.memory.network.max: 1gb

# ==============================================================================

# Flink Cluster Security Configuration

# ==============================================================================

# security.kerberos.login.use-ticket-cache: false

# security.kerberos.login.keytab: /mobi.me/flink/conf/smart3.keytab

# security.kerberos.login.principal: smart_user

# security.kerberos.login.contexts: Client,KafkaClient

# ==============================================================================

# ZK Security Configuration

# ==============================================================================

# zookeeper.sasl.login-context-name: Client

# ==============================================================================

# HistoryServer

# ==============================================================================

# jobmanager.archive.fs.dir: hdfs:///completed-jobs/

# historyserver.web.address: 0.0.0.0

# historyserver.web.port: 8082

# historyserver.archive.fs.dir: hdfs:///completed-jobs/

# historyserver.archive.fs.refresh-interval: 10000

blob.server.port: 6124
query.server.port: 6125
taskmanager.rpc.port: 6122
high-availability.jobmanager.port: 50010
zookeeper.sasl.disable: true

# recovery.mode: zookeeper

# recovery.zookeeper.quorum: zookeeper1:2181,zookeeper2:2181,zookeeper3:2181

# recovery.zookeeper.path.root: /

# recovery.zookeeper.path.namespace: /cluster_one
rwqw0loc

rwqw0loc1#

解决方案是增加flink-conf.yaml中的元空间大小。
比尔,安德烈é.

相关问题