incubator-doris FE restart failed when has follower is not alive

pftdvrlh  于 2022-04-22  发布在  Java
关注(0)|答案(1)|浏览(346)

Describe the bug

i upgrade my doris cluster to master version.
and found error in fe'restart.
check the log and found content as below:

2020-09-26 11:37:18,961 WARN (UNKNOWN 172.28.18.140_9010_1591588831143(-1)|1) [Catalog.notifyNewFETypeTransfer():2356] notify new FE type transfer: UNKNOWN
2020-09-26 11:37:20,967 WARN (RepNode 172.28.18.140_9010_1591588831143(-1)|56) [Catalog.notifyNewFETypeTransfer():2356] notify new FE type transfer: MASTER
2020-09-26 11:37:21,162 ERROR (stateListener|67) [EditLog.loadJournal():804] Operation Type 29
java.lang.NullPointerException: null
        at org.apache.doris.consistency.ConsistencyChecker.replayFinishConsistencyCheck(ConsistencyChecker.java:373) ~[palo-fe.jar:3.4.0]
        at org.apache.doris.persist.EditLog.loadJournal(EditLog.java:332) [palo-fe.jar:3.4.0]
        at org.apache.doris.catalog.Catalog.replayJournal(Catalog.java:2497) [palo-fe.jar:3.4.0]
        at org.apache.doris.catalog.Catalog.transferToMaster(Catalog.java:1167) [palo-fe.jar:3.4.0]
        at org.apache.doris.catalog.Catalog.access$1100(Catalog.java:261) [palo-fe.jar:3.4.0]
        at org.apache.doris.catalog.Catalog$4.runOneCycle(Catalog.java:2414) [palo-fe.jar:3.4.0]
        at org.apache.doris.common.util.Daemon.run(Daemon.java:116) [palo-fe.jar:3.4.0]

To Reproduce

Steps to reproduce the behavior:

  1. run command add follower ... on the old version of doris. (the follower fe to be added is NOT started now)
  2. run bin/stop_fe.sh to stop the old version of fe
  3. upgrade files to new version of fe. e.g. lib/* webroot/*
  4. run bin/start_fe.sh to start the new version of fe
  5. check the log then found the error as above

how to prevent in trick method.

  1. rollback the version
  2. run command drop follower ... on the old version of doris
  3. upgrade files and restart fe
  4. fe start ok
  5. run add follower ... on the new version

Expected behavior

  1. add follower on the old version whether the service survives or not
  2. upgrade the version of fe
  3. restart ok
ffdz8vbo

ffdz8vbo1#

This is strange. the code shows that this is because the tablet does not exist.
This is hard to debug without fe.log.
But this is not a very serious problem, we can modify the code to just skip this tablet.

相关问题