spack[scala]：按键减少嵌套的元组值

rlcwz9us 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(441)

假设我有一个spark scala程序，它的rdd名为 mention_rdd 具体如下：

(name, (filename, sum))
...
(Maria, (file0, 3))
(John, (file0, 1))
(Maria, (file1, 6))
(Maria, (file2, 1))
(John, (file2, 3))
...

其中有文件名和每个名称的出现次数。
我想减少并找到，为每个名称，与最大发生的文件名。例如：

(name, (filename, max(sum))
...
(Maria, (file1, 6))
(John, (file2, 3))
...

我试着进入 (filename,sum) rdd本身的元组，所以我可以从那里减少 name （由于一个错误说我不能从 mention_rdd 因为 (String,Int) 不是一个 TraversableOnce 类型）：

val output = mention_rdd.flatMap(file_counts => file_counts._2.map(file_counts._2._1, file_counts._2._2))   
        .reduceByKey((a, b) => if (a > b) a else b)

但是我得到一个错误，说值Map不是（string，int）的成员
这能在spark内实现吗？如果是，怎么做？我的方法从一开始就有缺陷吗？

mapreduce scala rdd apache-spark reduce

来源：https://stackoverflow.com/questions/62093635/spack-scala-reduce-a-nested-tuple-value-by-key

1条答案

按热度按时间

xuo3flqw1#

为什么不只是：

val output = mention_rdd.reduceByKey {
  case ((file1, sum1), (file2, sum2)) =>
    if (sum2 >= sum1) (file2, sum2)
    else (file1, sum1)
}

赞(0）回复(0）举报 2021-05-27

我来回答

spack[scala]：按键减少嵌套的元组值

1条答案

相关问题

热门标签

最新问答