如何仅打印具有最大值的行

3pmvbmvn  于 2021-05-19  发布在  Spark
关注(0)|答案(2)|浏览(416)

我有以下Dataframe df :

+----------+--------+---------+
|        ID| text   |    count|
+----------+--------+---------+
|         3|    word|      316|
|         3|    work|      385|
|         3|    want|      205|
|         3|     cat|      251|
|         1|  office|      343|
|         1|     sky|      643|
|         1|   going|      126|
|         2|    home|      124|
|         2|  school|       23|
|         2|   sleep|      103|
//and so on

现在,每个 ID ,我只想显示最大为2的行 counts 放下/隐藏剩下的:

+----------+--------+---------+
|        ID| text   |    count|
+----------+--------+---------+
|         3|    word|      316|
|         3|    work|      385|
|         1|  office|      343|
|         1|     sky|      643|
|         2|    home|      124|
|         2|   sleep|      103|
//and so on

我们如何最有效地实现这一点?

des4xlb0

des4xlb01#

检查以下代码。

scala> df.show(false)
+---+------+-----+
|ID |text  |count|
+---+------+-----+
|3  |word  |316  |
|3  |work  |385  |
|3  |want  |205  |
|3  |cat   |251  |
|1  |office|343  |
|1  |sky   |643  |
|1  |going |126  |
|2  |home  |124  |
|2  |school|23   |
|2  |sleep |103  |
+---+------+-----+
scala> import org.apache.spark.sql.expressions._

scala> val windowSpec = Window
.partitionBy($"id")
.orderBy($"ID".asc,$"count".desc)
scala> df
.withColumn("rno",row_number().over(windowSpec))
.filter($"rno" <= 2)
.show(false)

+---+------+-----+---+
|ID |text  |count|rno|
+---+------+-----+---+
|3  |work  |385  |1  |
|3  |word  |316  |2  |
|1  |sky   |643  |1  |
|1  |office|343  |2  |
|2  |school|23   |1  |
|2  |home  |124  |2  |
+---+------+-----+---+
w1jd8yoj

w1jd8yoj2#

在spark和 partitionBy
ID orderBycount .
例子:

val df=Seq((3,"word",316),(3,"work",385),(3,"want",205),(3,"cat",251),(1,"office",343),(1,"sky",643),(1,"going",126),(2,"home",124),(2,"school",23),(2,"sleep",103)).toDF("ID","text","count")

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val w=Window.partitionBy(col("ID")).orderBy(desc("count"))
df.withColumn("rn",row_number().over(w)).filter(col("rn") <=2).drop("rn").show()
//+---+------+-----+
//| ID|  text|count|
//+---+------+-----+
//|  1|   sky|  643|
//|  1|office|  343|
//|  3|  work|  385|
//|  3|  word|  316|
//|  2|  home|  124|
//|  2| sleep|  103|
//+---+------+-----+

相关问题