我正在使用spark-sql-2.4.1v和java8。我需要计算一些给定数据的百分位数,比如25,75,90。
我试过用 percentile_approx()
从sparksql开始。但是 percentile_approx()
与excel工作表中使用的分数百分位数不匹配 PERCENTILE.INC()
.
因此,我想知道如何修复或调整 percentile_approx()
功能。是否还有重写或编写自定义函数的方法 percentile_approx()
哪个能正确计算分数百分位数?如何编写/修改 percentile_approx()
?
给定数据集:
val df = Seq(
(10, "1/15/2018", 0.010680705, 10,0.619875458, "east"),
(10, "1/15/2018", 0.006628853, 4,0.16039063, "west"),
(10, "1/15/2018", 0.01378215, 20,0.082049528, "east"),
(10, "1/15/2018", 0.810680705, 6,0.819875458, "west"),
(10, "1/15/2018", 0.702228853, 30,0.916039063, "east"))
.toDF("id", "date", "revenue", "con_dist_1", "con_dist_2", "zone")
val percentiles = Seq(0.25, 0.75,0.90) // Which percentiles to calculate
val cols = Seq("con_dist_1", "con_dist_2") // The columns to use
我需要为给定的列计算每个区域的给定百分位数。如何做到这一点?
预期结果:
+---+---------+-----------+----+------------+--------------+--------------+-------------+
| id| date| revenue|zone|perctile_col|qunantile_0.25|qunantile_0.75|qunantile_0.9|
+---+---------+-----------+----+------------+--------------+--------------+-------------+
| 10|1/15/2018|0.006628853|west| con_dist_1| 4.5| 5.5| 5.8|
| 10|1/15/2018|0.010680705|west| con_dist_1| 4.5| 5.5| 5.8|
| 10|1/15/2018|0.010680705|east| con_dist_1| 15| 25| 28.0|
| 10|1/15/2018| 0.01378215|east| con_dist_1| 15| 25| 28.0|
| 10|1/15/2018|0.006628853|east| con_dist_1| 15| 25| 28.0|
| 10|1/15/2018|0.006628853|west| con_dist_2| 0.325261837| 0.655004251| 0.7539269752|
| 10|1/15/2018|0.010680705|west| con_dist_2| 0.325261837| 0.655004251| 0.7539269752|
| 10|1/15/2018|0.010680705|east| con_dist_2| 0.350962493| 0.4990442955| 0.749241156|
| 10|1/15/2018| 0.01378215|east| con_dist_2| 0.350962493| 0.4990442955| 0.749241156|
| 10|1/15/2018|0.006628853|east| con_dist_2| 0.350962493| 0.4990442955| 0.749241156|
+---+---------+-----------+----+------------+--------------+--------------+-------------+
您可以使用此url的“定义2”验证结果https://www.translatorscafe.com/unit-converter/en-us/calculator/percentile/
1条答案
按热度按时间jei2mxaa1#
使用spark解决此问题的一种简单方法是手动查找与指定百分位值最接近的两个值。这样就可以很容易地计算出分数部分。
在scala中,可以按以下方式进行:
首先,我们得到每一行的排名
zone
除以每组的最高等级。这将提供:
我们在所有列上循环考虑并执行
foldLeft
在百分位数上添加每个(lower_val
以及upper_val
). 我们在同一时间计算分数,然后通过将分数加到下界来计算分位数值。最后,由于我们在列上循环,我们使用
reduce(_.union(_))
将所有内容恢复到单个Dataframe。结果: