使用groupby选择出现次数最多的值

h43kikqp  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(509)

输入示例:

Artist Skill
1. Bono Vocals
2. Bono Vocals
3. Bono Vocals
4. Bono Guitar
5. Edge Vocals
6. Edge Guitar
7. Edge Guitar
8. Edge Guitar
9. Edge     Bass
10. Larry   Drum
11. Larry   Drum
12. Larry   Guitar
13. Clayton Bass
14. Clayton Bass
15. Clayton Guitar

相应的输出
艺术家最常用的技巧

1. Bono Vocals Edge Guitar Larry Drum Clayton Bass

我有一个dataframe,我想使用scala创建一个确定性代码来生成一个新的dataframe,其中每个不同的“艺术家”只有一行,而相应的艺术家则有最常见的“技能”。

zbsbpyhn

zbsbpyhn1#

你可以合并 groupBy 以及 window 功能如下

val window = Window.partitionBy("Artist").orderBy($"count".desc)
df.groupBy("Artist", "Skill")
  .agg(count("Skill").as("count")). // gives you count of artist and skill
  //select the first row with adding rownumber 
  .withColumn("rn", row_number over window).where($"rn" === 1 ) 
  .drop("rn", "count")
  .show(false)

输出:

+-------+------+
|Artist |Skill |
+-------+------+
|Clayton|Bass  |
|Larry  |Drum  |
|Edge   |Guitar|
|Bono   |Vocals|
+-------+------+

相关问题