apache风暴:由于未知原因,整个作业在2~3天后挂起

m4pnthwp  于 2021-06-21  发布在  Storm
关注(0)|答案(0)|浏览(137)

最近我提交了一个用python(2.7.6)编写的带有多语言协议的storm(0.9.5)作业。bolt类首先继承自basicbolt(带有ack),我没有设置max.spout.pending。

class SnifferSpout(storm.Spout):
    def __init__(self):
        ...

class MonitorBolt(storm.BasicBolt):
    def __init__(self):
        ...

TopologyBuilder builder = new TopologyBuilder();
      builder.setSpout("sniffer", new SnifferSpout(), 1);
      builder.setBolt("relation", new MonitorBolt(), 3).shuffleGrouping("sniffer");

conf.setDebug(false);
conf.setNumWorkers(4);
StormSubmitter.submitTopology(args[0], conf, builder.createTopology());

它可以工作,但整个工作开始挂起后2~3天。
我不知道为什么。我发现在最初的几个小时里,进程延迟是难以置信的(与执行延迟相反)高。

而且进程延迟有时会增加得太高(比如36456ms)
此外,在一名工人的日志中,我发现

2015-10-31T12:14:30.784+0000 b.s.s.ShellSpout [ERROR] Halting process: ShellSpout died.
java.lang.RuntimeException: subprocess heartbeat timeout
    at backtype.storm.spout.ShellSpout$SpoutHeartbeatTimerTask.run(ShellSpout.java:261) [storm-core-0.9.5.jar:0.9.5]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_79]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_79]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_79]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_79]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
2015-10-31T12:14:30.795+0000 b.s.d.executor [ERROR] 
java.lang.RuntimeException: subprocess heartbeat timeout
    at backtype.storm.spout.ShellSpout$SpoutHeartbeatTimerTask.run(ShellSpout.java:261) [storm-core-0.9.5.jar:0.9.5]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_79]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_79]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_79]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_79]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]

我怀疑这个问题是由oom引起的,所以我在挂起作业后检查了内存,发现有3个java进程,每个进程消耗了大约22%的主内存。python进程只消耗1.x%的内存。
我不能确定记忆是个问题。所以我尝试用bolt而不是basicbolt来移除ack,并将maxspoutpending设置为200。
现在更糟糕的事情发生了,作业消耗内存的速度非常快(内存在大约10分钟内从2.83g下降到480m),执行器由于oom每隔大约10分钟重新启动一次。

任何人都可以帮助找到这一切发生的根本原因

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题