我正在使用spark-sql-2.4.1v和java8。我需要计算一些给定数据的百分位数,比如25,75,90。
给定源数据集:
val df = Seq(
(10, "1/15/2018", 0.010680705, 10, 0.619875458, 0.010680705, "east"),
(10, "1/15/2018", 0.006628853, 4, 0.16039063, 0.01378215, "west"),
(10, "1/15/2018", 0.01378215, 20, 0.082049528, 0.010680705, "east"),
(10, "1/15/2018", 0.810680705, 6, 0.819875458, 0.702228853, "west"),
(10, "1/15/2018", 0.702228853, 30, 0.916039063, 0.810680705, "east"),
(11, "1/15/2018", 0.010680705, 10, 0.619875458, 0.010680705, "east"),
(11, "1/15/2018", 0.006628853, 4, 0.16039063, 0.01378215, "west"),
(11, "1/15/2018", 0.01378215, 20, 0.082049528, 0.010680705, "east"),
(11, "1/15/2018", 0.810680705, 6, 0.819875458, 0.702228853, "west"),
(11, "1/15/2018", 0.702228853, 30, 0.916039063, 0.810680705, "east"))
.toDF("id", "date", "revenue", "con_dist_1", "con_dist_2", "con_dist_3", "zone")
val percentiles = Seq(0.25, 0.75,0.90) // Which percentiles to calculate
val cols = Seq("con_dist_1", "con_dist_2") // The columns to use
+---+---------+-----------+----------+-----------+-----------+----+
| id| date| revenue|con_dist_1| con_dist_2| con_dist_3|zone|
+---+---------+-----------+----------+-----------+-----------+----+
| 10|1/15/2018|0.010680705| 10|0.619875458|0.010680705|east|
| 10|1/15/2018|0.006628853| 4| 0.16039063| 0.01378215|west|
| 10|1/15/2018| 0.01378215| 20|0.082049528|0.010680705|east|
| 10|1/15/2018|0.810680705| 6|0.819875458|0.702228853|west|
| 10|1/15/2018|0.702228853| 30|0.916039063|0.810680705|east|
| 11|1/15/2018|0.010680705| 10|0.619875458|0.010680705|east|
| 11|1/15/2018|0.006628853| 4| 0.16039063| 0.01378215|west|
| 11|1/15/2018| 0.01378215| 20|0.082049528|0.010680705|east|
| 11|1/15/2018|0.810680705| 6|0.819875458|0.702228853|west|
| 11|1/15/2018|0.702228853| 30|0.916039063|0.810680705|east|
+---+---------+-----------+----------+-----------+-----------+----+
我需要为给定的列计算每个区域的给定百分位数。如何做到这一点?
预期结果
+----+---------+-------------+-------------+------------+-------------+
| id| date| revenue| perctile_col| quantile_0 |quantile_10 |
+----+---------+-------------+-------------+------------+-------------+
| 10|1/15/2018| 0.010680705| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.010680705| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.01378215| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.01378215| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.010680705| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.010680705| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_2 |<quant0_val>|<quant10_val>|
| 11|1/15/2018| 0.010680705| con_dist_1 |<quant0_val>|<quant10_val>|
| 11|1/15/2018| 0.010680705| con_dist_2 |<quant0_val>|<quant10_val>|
| 11|1/15/2018| 0.006628853| con_dist_1 |<quant0_val>|<quant10_val>|
| 11|1/15/2018| 0.006628853| con_dist_2 |<quant0_val>|<quant10_val>|
| 11|1/15/2018| 0.01378215| con_dist_1 |<quant0_val>|<quant10_val>|
| 11|1/15/2018| 0.01378215| con_dist_2 |<quant0_val>|<quant10_val>|
| 11|1/15/2018| 0.010680705| con_dist_1 |<quant0_val>|<quant10_val>|
| 11|1/15/2018| 0.010680705| con_dist_2 |<quant0_val>|<quant10_val>|
| 11|1/15/2018| 0.006628853| con_dist_1 |<quant0_val>|<quant10_val>|
| 11|1/15/2018| 0.006628853| con_dist_2 |<quant0_val>|<quant10_val>|
+----+---------+-------------+-------------+------------+-------------+
注:这里的分位数0等是百分位0、50等。
暂无答案!
目前还没有任何答案,快来回答吧!