cassandra K8ssandra pod正在重播大型提交日志,并且没有响应

webghufk  于 2023-02-08  发布在  Cassandra
关注(0)|答案(1)|浏览(178)

我们有一个3节点Cassandra 4集群,在某个时候(我不知道为什么),我们得到的ndo之一:

CommitLog.java:173 - Replaying /opt/cassandra/data/commitlog/CommitLog-7-1674673652744.log

有一长串日志
我们可以从指标中看到,磁盘吞吐量约为17 GB
在此期间,我们在其他2个节点中看到(正在重放的节点几乎2m没有响应):

NoSpamLogger.java:98 - /20.9.1.45:7000->prod-k8ssandra-seed-service/20.9.0.242:7000-SMALL_MESSAGES-[no-channel] failed to connect
java.nio.channels.ClosedChannelException: null
    at org.apache.cassandra.net.OutboundConnectionInitiator$Handler.channelInactive(OutboundConnectionInitiator.java:248)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
    at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:819)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Unknown Source)

问题:
1.重放提交日志的原因是什么?
1.我们能否降低此节点中断风险?

    • 更新日期:**

节点的重新启动看起来像是由k8ssandra发起的...这可以解释重播,HTTP 500的原因是什么?我似乎看不到

INFO  [nioEventLoopGroup-2-2] 2023-01-25 19:07:10,694 Cli.java:617 - address=/127.0.0.6:53027 url=/api/v0/probes/liveness status=200 OK
INFO  [nioEventLoopGroup-2-1] 2023-01-25 19:07:12,698 Cli.java:617 - address=http url=/api/v0/probes/readiness status=500 Internal Server Error
INFO  [epollEventLoopGroup-38-1] 2023-01-25 19:07:20,700 Clock.java:47 - Using native clock for microsecond precision
WARN  [epollEventLoopGroup-38-2] 2023-01-25 19:07:20,701 AbstractBootstrap.java:452 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x919a5c8b]'
WARN  [epollEventLoopGroup-38-2] 2023-01-25 19:07:20,703 Loggers.java:39 - [s33] Error connecting to Node(endPoint=/tmp/cassandra.sock, hostId=null, hashCode=71aac1d0), trying next node (AnnotatedConnectException: connect(..) failed: Connection refused: /tmp/cassandra.sock)
INFO  [nioEventLoopGroup-2-2] 2023-01-25 19:07:20,703 Cli.java:617 - address=/127.0.0.6:51773 url=/api/v0/probes/readiness status=500 Internal Server Error
INFO  [epollEventLoopGroup-39-1] 2023-01-25 19:07:25,393 Clock.java:47 - Using native clock for microsecond precision
WARN  [epollEventLoopGroup-39-2] 2023-01-25 19:07:25,394 AbstractBootstrap.java:452 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x80b52436]'
WARN  [epollEventLoopGroup-39-2] 2023-01-25 19:07:25,395 Loggers.java:39 - [s34] Error connecting to Node(endPoint=/tmp/cassandra.sock, hostId=null, hashCode=cc8ec36), trying next node (AnnotatedConnectException: connect(..) failed: Connection refused: /tmp/cassandra.sock)
INFO  [pool-2-thread-1] 2023-01-25 19:07:25,602 LifecycleResources.java:186 - Started Cassandra
h6my8fg2

h6my8fg21#

当Cassandra没有干净地关闭时,Cassandra没有机会将memtable的内容持久化到磁盘,因此当它重新启动时,Cassandra会重放提交日志以重新填充memtable。
看起来你混淆了因果关系。K8ssandra操作员重新启动了吊舱,因为它没有React--重新启动是结果,而不是原因。
您需要查看pod上的Cassandra日志,以找到它为什么没有响应的线索。根据您的描述,在重新启动时有一个很大的commitlog被重放,我怀疑有大量的流量到集群(一个很大的commitlog是大量写入的结果),并且一个过载的节点会解释它为什么没有响应。同样,您需要查看日志以确定原因。
K8ssandra使用“liveness”和“readiness”探测器(又名健康检查)监控pod,HTTP 500错误可能是节点无响应的结果。这可能会触发操作员启动pod的重启以自动恢复它。干杯!

相关问题