Yarn节点终止后无法恢复

nvbavucw  于 2021-06-21  发布在  Flink
关注(0)|答案(0)|浏览(305)

我们正在用毛线织东西。我们正在执行灾难恢复测试,作为测试的一部分,我们手动终止了运行flink应用程序的其中一个节点。示例恢复后,应用程序进行了多次尝试,每次尝试都出现以下错误:

AM Container for appattempt_1602902099413_0006_000027 exited with exitCode: -1000
Failing this attempt.Diagnostics: Could not obtain block: BP-986419965-xx.xx.xx.xx-1602902058651:blk_1073743332_2508 
file=/user/hadoop/.flink/application_1602902099413_0006/application_1602902099413_0006-flink-conf.yaml1528536851005494481.tmp
org.apache.hadoop.hdfs.BlockMissingException:
 Could not obtain block: BP-986419965-10.61.71.85-1602902058651:blk_1073743332_2508 file=/user/hadoop/.flink/application_1602902099413_0006/application_1602902099413_0006-flink-conf.yaml1528536851005494481.tmp
 at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1053)at
 org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1036)at 
 org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1015)at 
 org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:647)at 
 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:926)at
 org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:982)at 
 java.io.DataInputStream.read(DataInputStream.java:100)at 
 org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:90)at 
 org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:64)at 
 org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:125)at
 org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)at 
 org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:267)at 
 org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)at 
 org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)at
 org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)at 
 java.security.AccessController.doPrivileged(Native Method)at 
 javax.security.auth.Subject.doAs(Subject.java:422)at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)at
 org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)at 
 org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)at 
 java.util.concurrent.FutureTask.run(FutureTask.java:266)at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at
 java.util.concurrent.FutureTask.run(FutureTask.java:266)at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at 
 java.lang.Thread.run(Thread.java:748)For
 more detailed output, check the application tracking page: http://<>.compute.internal:8088/cluster/app/application_1602902099413_0006 Then click on links to logs of each attempt.

有人能告诉我们什么内容存储在hdfs中,如果这可以重定向到s3吗?
添加与检查点相关的设置:

StateBackend rocksDbStateBackend = new RocksDBStateBackend("s3://Path", true);
streamExecutionEnvironment.setStateBackend(rocksDbStateBackend)
streamExecutionEnvironment.enableCheckpointing(10000);
streamExecutionEnvironment.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
streamExecutionEnvironment.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);
streamExecutionEnvironment.getCheckpointConfig().setCheckpointTimeout(60000);
streamExecutionEnvironment.getCheckpointConfig().setMaxConcurrentCheckpoints(60000);
streamExecutionEnvironment.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
streamExecutionEnvironment.getCheckpointConfig().setPreferCheckpointForRecovery(true);

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题