使用through()vs tostream()+to()重用流

pcww981p 于 2021-06-06 发布在 Kafka

关注(0)|答案(1)|浏览(286)

我想知道在.tostream（）+.to（）的流引用上使用.through（）重用流的区别
使用.through（） KStream<String, String> subStream = mainStream .groupByKey(..) .aggregate(..) .toStream(..); .through("aggregate-topic", ..); // Then use the (new) stream from .through() to create another topic 与使用.tostream（）+.to（）相比 KStream<String, String> subStream = mainStream .groupByKey(..) .aggregate(..) .toStream(..); subStream.to("aggregate-topic", ..); //reuse the existing subStream from toStream() to create another topic 我实现了一个使用后者的特性，因为在我学习through（）方法之前，这是有意义的。
我现在很好奇的是，两种选择都有内在的原因；选择一个选项而不是另一个选项有什么好处/缺点吗？

apache-kafka apache-kafka-streams

来源：https://stackoverflow.com/questions/53112384/kafka-streams-reusing-streams-using-through-vs-tostream-to

1条答案

按热度按时间

6ioyuze21#

是的，有区别和不同的权衡：
1）第一个版本使用 through() 将创建“线性平面”，并将拓扑拆分为两个子拓扑。请注意 through("topic") 确切地说 to("topic") 加 builder.stream("topic") .

mainStream -> grp -> agg -> toStream -> to -> TOPIC -> builder.stream -> subStream

第一个子拓扑将来自 mainStream 至 to() ; 这个 "aggregate-topic" 将其与第二个子拓扑分离，第二个子拓扑由 builder.stream() 并注入 subStream . 这意味着，当所有数据都写入 "aggregate-topic" 先读，然后再读。这将增加端到端处理延迟，并增加额外读取操作的代理负载。优点是，两个子拓扑可以独立地并行化。它们的并行性是独立的，由它们相应的输入主题分区的数量决定。这将创建更多的任务，从而允许更多的并行性，因为这两个子拓扑可以在不同的线程上执行。
2）第二个版本将创建一个“分支计划”，并作为一个子拓扑执行：

mainStream -> grp -> agg -> toStream -+-> to -> TOPIC
                                      |
                                      + -> subStream

之后 toStream() 数据在逻辑上广播到两个下游操作符。这意味着，没有往返通过 "aggregate-topic" 但记录会在内存中转发给 subStream . 这减少了端到端延迟，并且不需要从kafka集群读回数据。但是，您可以减少任务，从而降低最大并行性。

赞(0）回复(0）举报 2021-06-07

我来回答

使用through()vs tostream()+to()重用流

1条答案

相关问题

热门标签

最新问答