如何修复停靠的elasticsearch示例中损坏的translog

5f0d552i  于 2023-03-07  发布在  ElasticSearch
关注(0)|答案(1)|浏览(963)

tl;dr: Running the elasticsearch-shard utility when your ElasticSearch instance is dockerized does not seem possible. If this is true how can we fix the occasional corrupted translog errors that crash ES??

I have had ElasticSearch (ES) running nicely locally via docker using docker-compose for some time now, but today when I started it up it started crashing with the error message:
TranslogCorruptedException[translog from source [/usr/share/elasticsearch/data/nodes/0/indices/0eNM-3niSvS0BUwAHf9M0w/0/translog/translog-175.tlog] is corrupted (see end of post for full error message)
Some googling revealed that this issue can be solved by running the utility bin/elasticsearch-shard remove-corrupted-data . The problem is that in order to run this utility ES must be shut down, but in order for the container that is hosting the ES instance to be alive ES needs to be running. This means that there is no way to have access to elasticsearch-shard to fix the issue inside of the environment where the data and the elasticsearch instance actually lives.
I have verified that it wont stay alive by stopping ES from within the command line of the container like so

## get into the docker container
docker exec -it 43146ff2a50c bash
## kill elasticsearch
pkill -f elasticsearch

and it immediately kills the container and kicks me out of the shell.
I tried to see if another docker container with access to the same data volumes but not based on an ES image (so that it could be alive while ES is off) could run the utility and fix the data on disk. I made a new docker-compose entry with a new Dockerfile and kept all the settings the same, but based the build on an ubuntu image (ignore the environment variables except ES_01_DATA_VOLUME , they aren't relevant):

docker-compose.yml

es01-truncate-corrupted-shards:
        build:
            context: .
            dockerfile: Elasticsearch.TruncateCorruptedShards.Dockerfile
            args:
                - CERTS_DIR=${CERTS_DIR}
        container_name: es01-truncate-corrupted-shards
        environment:
            - node.name=es01
            - cluster.name=es-docker-cluster
            - discovery.seed_hosts=es02,es03
            - cluster.initial_master_nodes=es01,es02,es03
            - bootstrap.memory_lock=true
            - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
            - xpack.license.self_generated.type=basic 
            - xpack.security.enabled=true
            - xpack.security.http.ssl.enabled=true 
            - xpack.security.http.ssl.key=$CERTS_DIR/es01/es01.key
            - xpack.security.http.ssl.certificate_authorities=$CERTS_DIR/ca/ca.crt
            - xpack.security.http.ssl.certificate=$CERTS_DIR/es01/es01.crt
            - xpack.security.transport.ssl.enabled=true 
            - xpack.security.transport.ssl.verification_mode=certificate 
            - xpack.security.transport.ssl.certificate_authorities=$CERTS_DIR/ca/ca.crt
            - xpack.security.transport.ssl.certificate=$CERTS_DIR/es01/es01.crt
            - xpack.security.transport.ssl.key=$CERTS_DIR/es01/es01.key
        ulimits:
            memlock:
                soft: -1
                hard: -1
        volumes:
            - ${ES_01_DATA_VOLUME}
            - ${CERTS_VOLUME}
        ports:
            - ${ES_01_PORT}
        mem_limit: ${SINGLE_NODE_MEM_LIMIT}

Elasticsearch.TruncateCorruptedShards.Dockerfile

FROM ubuntu:rolling

RUN apt-get update \
    && apt-get install --yes curl \
    && apt-get install -y gnupg \
    && curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch | apt-key add - \
    && echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | tee -a /etc/apt/sources.list.d/elastic-7.x.list \
    && apt update \
    && apt install elasticsearch

RUN /usr/share/elasticsearch/bin/elasticsearch-shard remove-corrupted-data

When i run this it installs everything correctly and attempts to use the utility, but then errors like so:

#6 1.265     WARNING: Elasticsearch MUST be stopped before running this tool.
#6 1.265
#6 1.360 Exception in thread "main" ElasticsearchException[no node folder is found in data folder(s), node has not been started yet?]
#6 1.363    at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.processDataPaths(ElasticsearchNodeCommand.java:148)
#6 1.363    at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.execute(ElasticsearchNodeCommand.java:168)
#6 1.363    at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:77)
#6 1.363    at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:112)
#6 1.363    at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:95)
#6 1.363    at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:112)
#6 1.363    at org.elasticsearch.cli.Command.main(Command.java:77)
#6 1.363    at org.elasticsearch.index.shard.ShardToolCli.main(ShardToolCli.java:24)

which leads me to believe that despite having access to the ES_01_DATA_VOLUME volume it knows that an instance hasn't been set up in this container.
Ultimately I'm not too concerned with how the corrupted translog gets fixed as long as its possible, but it seems to me with these constraints of the docker environments its not possible. Do i need to install ES on the host machine and point it towards the data files and have it modify them? Seems like that is the same idea as the second non-ES container trick that I tried and so will fail. Also, that defeats the purpose of the containerized environment.
I am stumped and would be so grateful for any help. It's hard to imagine that fixing something like corrupted data files wouldn't be possible / be overlooked by the ES team!

Full error message from ES:

{"type": "server", "timestamp": "2022-07-28T22:40:49,356Z", "level": "WARN", "component": "o.e.i.c.IndicesClusterStateService", "cluster.name": "es-docker-cluster", "node.name": "es01", "message": "[application_log][0] marking and sending shard failed due to [shard failure, reason [failed to recover from translog]]", "cluster.uuid": "W-cXJOamQw-XU8LyZ9ZUoA", "node.id": "SBUMvZCRRTaZvhhVqmm9sQ" ,
es01     | "stacktrace": ["org.elasticsearch.index.engine.EngineException: failed to recover from translog",

and

{"type": "server", "timestamp": "2022-07-28T22:40:49,361Z", "level": "WARN", "component": "o.e.c.r.a.AllocationService", "cluster.name": "es-docker-cluster", "node.name": "es03", "message": "failing shard [failed shard, shard [plant_pod_application_log][0], node[SBUMvZCRRTaZvhhVqmm9sQ], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=_fVgJKMpSymo_mGd-QvRxQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2022-07-28T22:40:46.444Z], failed_attempts[4], failed_nodes[[VqFl_rNTRnyoHHgVJdIBhQ, SBUMvZCRRTaZvhhVqmm9sQ]], delayed=false, details[failed shard on node [VqFl_rNTRnyoHHgVJdIBhQ]: failed recovery, failure RecoveryFailedException[[plant_pod_application_log][0]: Recovery failed on {es03}{VqFl_rNTRnyoHHgVJdIBhQ}{qLLppr6pTrCa8-lCFhz1NA}{172.22.0.5}{172.22.0.5:9300}{dilmrt}{ml.machine_memory=5175267328, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog from source [/usr/share/elasticsearch/data/nodes/0/indices/0eNM-3niSvS0BUwAHf9M0w/0/translog/translog-187.tlog] is corrupted, translog truncated]; nested: EOFException[read past EOF. pos [16592762] length: [4] end: [16592762]]; ], allocation_status[fetching_shard_data]], message [shard failure, reason [failed to recover from translog]], failure [EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog from source [/usr/share/elasticsearch/data/nodes/0/indices/0eNM-3niSvS0BUwAHf9M0w/0/translog/translog-175.tlog] is corrupted, translog truncated]; nested: EOFException[read past EOF. pos [16592762] length: [4] end: [16592762]]; ], markAsStale [true]]", "cluster.uuid": "W-cXJOamQw-XU8LyZ9ZUoA", "node.id": "VqFl_rNTRnyoHHgVJdIBhQ" ,
es03     | "stacktrace": ["org.elasticsearch.index.engine.EngineException: failed to recover from translog",

I know that these are listed as WARNINGs, but they are the only type of wrong looking output and when I ping the cluster to check its health I get this payload:

{"cluster_name":"es-docker-cluster","status":"red","timed_out":false,"number_of_nodes":3,"number_of_data_nodes":3,"active_primary_shards":20,"active_shards":30,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":2,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":93.75

As well as Kibana not loading and results forever

tv6aics1

tv6aics11#

晚了7个月,但我目前正在使用一个Docker示例尝试修复集群外的一堆损坏的索引,并运行到相同的-所以我记录后代以及未来我再次运行到这个:-)
解决办法很简单:运行容器,但覆盖入口点,也就是说,将-it添加到开关(交互式容器),并在映像名称后添加/bin/bash,这样,您将在bash shell中结束一个新启动的容器,而不是运行ES。
然后你可以运行/usr/local/bin/docker-entrypoint.sh来启动ES,用ctrl-C杀死它,然后你会再次进入bash shell,在你退出bash之前,容器不会退出,所以你现在可以自由地运行elasticearch-shard或任何你需要的工具,再次启动ES来调用路由API等等。
还有,我碰到了一些事:以elasticsearch用户的身份运行elasticsearch-shard,因为如果你以root用户的身份运行它,它会以root用户的身份创建新的translog文件,ES将无法重新路由碎片。

相关问题