(工资最高的员工比其他同事多拿多少)按部门排序结果,并说明与最高工资的差异。输入数据:
val inputDf = Seq(
(1, "Hunter Fields", "IT", 15),
(2, "Leonard Lewis", "Support", 81),
(3, "Jason Dawson", "Support", 90),
(4, "Andre Grant", "Support", 25),
(5, "Earl Walton", "IT", 40),
(6, "Alan Hanson", "IT", 24),
(7, "Clyde Matthews", "Support", 31),
(8, "Josephine Leonard", "Support", 1),
(9, "Owen Boone", "HR", 27),
(10, "Max McBride", "IT", 75)
).toDF("id", "name", "department", "salary")
println("Input:")
inputDf.show(false)
我的解决方案是:
inputDf
import org.apache.spark.sql.expressions.Window
val windowSpec = Window
.partitionBy($"department")
.orderBy($"salary")
inputDf
.withColumn("diff", first(col("salary")).over(windowSpec)-$"salary")
.withColumn("diff", last(col("salary")).over(windowSpec)-$"salary")
输入:
+---+-----------------+----------+------+
|id |name |department|salary|
+---+-----------------+----------+------+
|1 |Hunter Fields |IT |15 |
|2 |Leonard Lewis |Support |81 |
|3 |Jason Dawson |Support |90 |
|4 |Andre Grant |Support |25 |
|5 |Earl Walton |IT |40 |
|6 |Alan Hanson |IT |24 |
|7 |Clyde Matthews |Support |31 |
|8 |Josephine Leonard|Support |1 |
|9 |Owen Boone |HR |27 |
|10 |Max McBride |IT |75 |
+---+-----------------+----------+------+
预期:
+---+-----------------+----------+------+----+
|id |name |department|salary|diff|
+---+-----------------+----------+------+----+
|9 |Owen Boone |HR |27 |0 |
|1 |Hunter Fields |IT |15 |60 |
|6 |Alan Hanson |IT |24 |51 |
|5 |Earl Walton |IT |40 |35 |
|10 |Max McBride |IT |75 |0 |
|8 |Josephine Leonard|Support |1 |89 |
|4 |Andre Grant |Support |25 |65 |
|7 |Clyde Matthews |Support |31 |59 |
|2 |Leonard Lewis |Support |81 |9 |
|3 |Jason Dawson |Support |90 |0 |
+---+-----------------+----------+------+----+
但目前我的实际结果是不正确的,因为diff column是0
1条答案
按热度按时间avwztpqn1#
应将窗口顺序改为降序:
然后: