keyby是否在flink(scala)中跨并行任务对数据流进行分区？

628mspwn 于 2021-06-21 发布在 Flink

关注(0)|答案(2)|浏览(549)

我想对flink中的输入数据流应用processfunction（），用一个缓存对象处理每个传入元素。我的代码如下所示：

object myJob extends FlinkJob {
 private val myCache = InMemoryCache()

 private def updateCache(myCache,someValue) : Boolean = {//some code}

 private def getValue(myCache,someKey) : Boolean = {//some code}

 def run(params, executionEnv) : Unit = {
   val myStream = executionEnv.getStream()

   val processedStream = myStream.process(new ProcessFunction {
     def processElement(value,context,collector) : Unit = {
      //Update cache
      //Collect updated event
     }
   }

   processedStream.write()
 }
}

当我并行化这个作业时，我假设作业的每个并行示例都有自己的cacheobject，因此，在多个cacheobject中可能存在一个cache键。但是，我希望对于一个特定的键有一个单独的缓存条目，也就是说，与一个特定键对应的所有记录都必须由一个示例和一个cacheobject处理。在mystream上使用keyby（）是否可以确保所有传入事件（具有相同的键）都由flink作业的单个并行任务/示例处理，因此也由单个cacheobject处理？

scala parallel-processing guava apache-flink flink-streaming

来源：https://stackoverflow.com/questions/55522028/does-keyby-partition-the-datastream-across-parallel-tasks-in-flink-scala