elasticsearch 7.4错误地抱怨快照已在运行

vsikbqxv  于 2021-06-13  发布在  ElasticSearch
关注(0)|答案(2)|浏览(630)

在解决了ElasticSearch7.4集群中的一些问题之后,现在的读取超时越来越慢,然后在我的集群中仍然有一些东西关闭了。每当我运行snapshot命令时,它会给我一个503,当我再次运行它一两次时,它会突然启动并创建一个快照。opster.com在线工具提示一些关于未配置快照的信息,但是当我运行它建议的verify命令时,一切看起来都很好。

$ curl -s -X POST 'http://127.0.0.1:9201/_snapshot/elastic_backup/_verify?pretty'
{
  "nodes" : {
    "JZHgYyCKRyiMESiaGlkITA" : {
      "name" : "elastic7-1"
    },
    "jllZ8mmTRQmsh8Sxm8eDYg" : {
      "name" : "elastic7-4"
    },
    "TJJ_eHLIRk6qKq_qRWmd3w" : {
      "name" : "elastic7-3"
    },
    "cI-cn4V3RP65qvE3ZR8MXQ" : {
      "name" : "elastic7-2"
    }
  }
}

但是后来:

curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "concurrent_snapshot_execution_exception",
        "reason" : "[elastic_backup:snapshot-2020.11.27]  a snapshot is already running"
      }
    ],
    "type" : "concurrent_snapshot_execution_exception",
    "reason" : "[elastic_backup:snapshot-2020.11.27]  a snapshot is already running"
  },
  "status" : 503
}

这4个节点中是否有一个认为快照已经在运行,并且此任务随机分配给其中一个节点,以便在运行几次后最终生成快照?如果是这样的话,我怎么知道哪个节点说快照已经在运行了呢?
此外,我注意到其中一个节点上的堆要高得多,正常的堆用法是什么?

$ curl -s http://127.0.0.1:9201/_cat/nodes?v
ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.0.1.215           59          99   7    0.38    0.38     0.36 dilm      -      elastic7-1
10.0.1.218           32          99   1    0.02    0.17     0.22 dilm      *      elastic7-4
10.0.1.212           11          99   1    0.04    0.17     0.21 dilm      -      elastic7-3
10.0.1.209           36          99   3    0.42    0.40     0.36 dilm      -      elastic7-2

昨晚它再次发生,而我确定没有什么已经快照,所以现在我运行以下命令,以确认奇怪的React,至少我不希望在这一点上得到这个错误。

$ curl http://127.0.0.1:9201/_snapshot/elastic_backup/_current?pretty
{
  "snapshots" : [ ]
}
$ curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "concurrent_snapshot_execution_exception",
        "reason" : "[elastic_backup:snapshot-2020.12.03]  a snapshot is already running"
      }
    ],
    "type" : "concurrent_snapshot_execution_exception",
    "reason" : "[elastic_backup:snapshot-2020.12.03]  a snapshot is already running"
  },
  "status" : 503
}

当我第二次(有时是第三次)运行它时,它会突然创建一个快照。
请注意,当我不运行它时,第二次或第三次快照将永远不会出现,因此我100%确定在出现此错误时没有快照正在运行。
据我所知,没有配置slm:

{ }

回购配置正确:

$ curl http://127.0.0.1:9201/_snapshot/elastic_backup?pretty
{
  "elastic_backup" : {
    "type" : "fs",
    "settings" : {
      "compress" : "true",
      "location" : "elastic_backup"
    }
  }
}

同样在配置中,它被Map到与amazonfs的nfs挂载相同的文件夹。它可用且可访问,并且在成功的快照上显示新数据。
作为cronjob的一部分,我已经添加到query中 _cat/tasks?v 希望今晚我们能看到更多。因为刚才我手动运行命令时,它没有出现问题:

$ curl localhost:9201/_cat/tasks?v ; curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty' ; curl localhost:9201/_cat/tasks?v     
action                         task_id                         parent_task_id                  type      start_time    timestamp running_time ip         node                                                        
cluster:monitor/tasks/lists    JZHgYyCKRyiMESiaGlkITA:15885091 -                               transport 1607068277045 07:51:17  209.6micros  10.0.1.215 elastic7-1                                                  
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:24278976 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277044 07:51:17  62.7micros   10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:15885092 JZHgYyCKRyiMESiaGlkITA:15885091 direct    1607068277045 07:51:17  57.4micros   10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:23773565 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277045 07:51:17  84.7micros   10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:3418325  JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277046 07:51:17  56.9micros   10.0.1.209 elastic7-2                                               
{                                                                                                                                                                  
  "snapshot" : {                                                                                                                                                   
    "snapshot" : "snapshot-2020.12.04",                                                                                                                            
    "uuid" : "u2yQB40sTCa8t9BqXfj_Hg",                                                                                                                                                                          
    "version_id" : 7040099,                                                                                                                                        
    "version" : "7.4.0",                                                                                                                                           
    "indices" : [                                                                                                                                                  
        "log-db-1-2020.06.18-000003",
        "log-db-2-2020.02.19-000002",
        "log-db-1-2019.10.25-000001",
        "log-db-3-2020.11.23-000002",
        "log-db-3-2019.10.25-000001",
        "log-db-2-2019.10.25-000001",
        "log-db-1-2019.10.27-000002"                                                                                                                              
    ],                                                                                                                                                             
    "include_global_state" : true,                                                                                                                                                                              
    "state" : "SUCCESS",                                                                                                                                           
    "start_time" : "2020-12-04T07:51:17.085Z",                                                                                                                                                                  
    "start_time_in_millis" : 1607068277085,                                                                                                                        
    "end_time" : "2020-12-04T07:51:48.537Z",                                                                                                                        
    "end_time_in_millis" : 1607068308537,                                                                                                                                 
    "duration_in_millis" : 31452,                                                                                                                                         
    "failures" : [ ],                                                                                                                                                     
    "shards" : {                                                                                                                                                          
      "total" : 28,                                                                                                                                                       
      "failed" : 0,                                                                                                                                                       
      "successful" : 28                                                                                                                                                   
    }                                                                                                                                                                     
  }                                                                                                                                                                       
}                                                                                                                                                                                                               
action                         task_id                         parent_task_id                  type      start_time    timestamp running_time ip         node                                                     
indices:data/read/search       JZHgYyCKRyiMESiaGlkITA:15888939 -                               transport 1607068308987 07:51:48  2.7ms        10.0.1.215 elastic7-1
cluster:monitor/tasks/lists    JZHgYyCKRyiMESiaGlkITA:15888942 -                               transport 1607068308990 07:51:48  223.2micros  10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:24282763 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308989 07:51:48  61.5micros   10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:15888944 JZHgYyCKRyiMESiaGlkITA:15888942 direct    1607068308990 07:51:48  78.2micros   10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:23777841 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308990 07:51:48  63.3micros   10.0.1.218 elastic7-4                                             
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:3422139  JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308991 07:51:48  60micros     10.0.1.209 elastic7-2

昨晚(2020-12-12)在cron期间,我让它运行以下命令:

curl localhost:9201/_cat/tasks?v
curl localhost:9201/_cat/thread_pool/snapshot?v
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
curl localhost:9201/_cat/tasks?v
sleep 1 
curl localhost:9201/_cat/thread_pool/snapshot?v
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
sleep 1
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
sleep 1
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'

其输出如下:

action                         task_id                         parent_task_id                  type      start_time    timestamp running_time ip         node
cluster:monitor/tasks/lists    JZHgYyCKRyiMESiaGlkITA:78016838 -                               transport 1607736001255 01:20:01  314.4micros  10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:82228580 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001254 01:20:01  66micros     10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:55806094 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001255 01:20:01  74micros     10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:78016839 JZHgYyCKRyiMESiaGlkITA:78016838 direct    1607736001255 01:20:01  94.3micros   10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:63582174 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001255 01:20:01  73.6micros   10.0.1.209 elastic7-2
node_name  name     active queue rejected
elastic7-2 snapshot      0     0        0
elastic7-4 snapshot      0     0        0
elastic7-1 snapshot      0     0        0
elastic7-3 snapshot      0     0        0
{            
  "error" : {       
    "root_cause" : [
      {                                            
        "type" : "concurrent_snapshot_execution_exception",                                                                                      
        "reason" : "[elastic_backup:snapshot-2020.12.12]  a snapshot is already running"
      }
    ],                                         
    "type" : "concurrent_snapshot_execution_exception",                                                                                      
    "reason" : "[elastic_backup:snapshot-2020.12.12]  a snapshot is already running"
  },            
  "status" : 503
}
action                         task_id                         parent_task_id                  type      start_time    timestamp running_time ip         node
cluster:monitor/nodes/stats    JZHgYyCKRyiMESiaGlkITA:78016874 -                               transport 1607736001632 01:20:01  39.6ms       10.0.1.215 elastic7-1
cluster:monitor/nodes/stats[n] TJJ_eHLIRk6qKq_qRWmd3w:82228603 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001631 01:20:01  39.2ms       10.0.1.212 elastic7-3
cluster:monitor/nodes/stats[n] jllZ8mmTRQmsh8Sxm8eDYg:55806114 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001632 01:20:01  39.5ms       10.0.1.218 elastic7-4
cluster:monitor/nodes/stats[n] cI-cn4V3RP65qvE3ZR8MXQ:63582204 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001632 01:20:01  39.4ms       10.0.1.209 elastic7-2
cluster:monitor/nodes/stats[n] JZHgYyCKRyiMESiaGlkITA:78016875 JZHgYyCKRyiMESiaGlkITA:78016874 direct    1607736001632 01:20:01  39.5ms       10.0.1.215 elastic7-1
cluster:monitor/tasks/lists    JZHgYyCKRyiMESiaGlkITA:78016880 -                               transport 1607736001671 01:20:01  348.9micros  10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:78016881 JZHgYyCKRyiMESiaGlkITA:78016880 direct    1607736001671 01:20:01  188.6micros  10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:82228608 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001671 01:20:01  106.2micros  10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:63582209 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001672 01:20:01  96.3micros   10.0.1.209 elastic7-2
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:55806120 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001672 01:20:01  97.8micros   10.0.1.218 elastic7-4
node_name  name     active queue rejected
elastic7-2 snapshot      0     0        0
elastic7-4 snapshot      0     0        0
elastic7-1 snapshot      0     0        0
elastic7-3 snapshot      0     0        0
{
  "snapshot" : {
    "snapshot" : "snapshot-2020.12.12",
    "uuid" : "DgwuBxC7SWirjyVlFxBnng",
    "version_id" : 7040099,
    "version" : "7.4.0",
    "indices" : [
      "log-db-sbr-2020.06.18-000003",
      "log-db-other-2020.02.19-000002",
      "log-db-sbr-2019.10.25-000001",
      "log-db-trace-2020.11.23-000002",
      "log-db-trace-2019.10.25-000001",
      "log-db-sbr-2019.10.27-000002",
      "log-db-other-2019.10.25-000001"
    ],
    "include_global_state" : true,
    "state" : "SUCCESS",
    "start_time" : "2020-12-12T01:20:02.544Z",
    "start_time_in_millis" : 1607736002544,
    "end_time" : "2020-12-12T01:20:27.776Z",
    "end_time_in_millis" : 1607736027776,
    "duration_in_millis" : 25232,
    "failures" : [ ],
    "shards" : {
      "total" : 28,
      "failed" : 0,
      "successful" : 28
    }
  }
}
{
  "error" : {
    "root_cause" : [
      {
        "type" : "invalid_snapshot_name_exception",
        "reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
      }
    ],
    "type" : "invalid_snapshot_name_exception",
    "reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
  },
  "status" : 400
}
{
  "error" : {
    "root_cause" : [
      {
        "type" : "invalid_snapshot_name_exception",
        "reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
      }
    ],
    "type" : "invalid_snapshot_name_exception",
    "reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
  },
  "status" : 400
}

而且目前集群是绿色的,管理队列还没有满,一切似乎都很好。
此外,只有一个存储库:

curl http://127.0.0.1:9201/_cat/repositories?v
id             type
elastic_backup   fs
agyaoht7

agyaoht71#

elasticsearch版本7.4一次只支持一个快照操作。
从错误来看,当您触发新快照和elasticsearch时,以前触发的快照似乎已经在运行 concurrent_snapshot_execution_exception .
您可以使用检查当前正在运行的快照的列表 GET /_snapshot/elastic_backup/_current .
我建议您首先使用上述api检查elasticsearch集群是否正在运行任何快照操作。如果当前没有运行快照操作,则只有您应该触发新快照。
p、 s:从elasticsearch 7.7版开始,elasticsearch也支持并发快照。因此,如果您计划在集群中执行并发快照操作,那么您应该升级es 7.7或更高版本。

tuwxkamq

tuwxkamq2#

因此,问题的起因是最近升级到docker 19.03.6,从1x docker swarm manager+4x docker swarm worker升级到5x docker swarm manager+4x docker swarm worker。在这两种情况下,弹性都对工人不利。由于这次升级/更改,我们看到了容器内网络接口数量的变化。正因为如此,我们不得不在elastic中使用“publish\u host”来重新工作。
为了解决这个问题,我们不得不放弃在入口网络上发布弹性端口,这样额外的网络接口就消失了。接下来我们可以删除“publish\u host”设置。这让事情变得更好了。但要真正解决我们的问题,我们必须将docker swarm deploy endpoint\u模式更改为dnsrr,这样事情就不会通过docker swarm路由网了。
我们一直都有“通过对等点重置连接”的问题,但自从更改之后,这种情况变得更糟,并使elasticsearch出现了奇怪的问题。我猜在docker swarm(或任何其他kubernetes之类的)中运行elasticsearch可能是一件很难调试的事情。
在容器中使用tcpdump,在主机上使用conntrack-s,我们可以看到完全良好的连接被无缘无故地重置。另一个解决方案是让内核丢弃不匹配的数据包(而不是发送重置),但是在这个示例中尽可能多地阻止使用dnat/snat似乎也解决了问题。

相关问题