如何从percentile\u近似代码中编写自定义函数,该代码在excel中给出与percentile.inc相同的结果?

rhfm7lfc  于 2021-06-25  发布在  Hive
关注(0)|答案(1)|浏览(400)

我正在使用spark-sql-2.4.1v和java8。我需要计算一些给定数据的百分位数,比如25,75,90。
我试过用 percentile_approx() 从sparksql开始。但是 percentile_approx() 与excel工作表中使用的分数百分位数不匹配 PERCENTILE.INC() .
因此,我想知道如何修复或调整 percentile_approx() 功能。是否还有重写或编写自定义函数的方法 percentile_approx() 哪个能正确计算分数百分位数?如何编写/修改 percentile_approx() ?
给定数据集:

val df = Seq(
    (10, "1/15/2018", 0.010680705, 10,0.619875458, "east"),
    (10, "1/15/2018", 0.006628853,  4,0.16039063, "west"),
    (10, "1/15/2018", 0.01378215,  20,0.082049528, "east"),
    (10, "1/15/2018", 0.810680705,  6,0.819875458, "west"),
    (10, "1/15/2018", 0.702228853, 30,0.916039063, "east"))     
  .toDF("id", "date", "revenue", "con_dist_1", "con_dist_2", "zone")

val percentiles = Seq(0.25, 0.75,0.90)  // Which percentiles to calculate
val cols = Seq("con_dist_1", "con_dist_2")  // The columns to use

我需要为给定的列计算每个区域的给定百分位数。如何做到这一点?
预期结果:

+---+---------+-----------+----+------------+--------------+--------------+-------------+
| id|     date|    revenue|zone|perctile_col|qunantile_0.25|qunantile_0.75|qunantile_0.9|
+---+---------+-----------+----+------------+--------------+--------------+-------------+
| 10|1/15/2018|0.006628853|west|  con_dist_1|           4.5|           5.5|          5.8|
| 10|1/15/2018|0.010680705|west|  con_dist_1|           4.5|           5.5|          5.8|
| 10|1/15/2018|0.010680705|east|  con_dist_1|            15|            25|         28.0|
| 10|1/15/2018| 0.01378215|east|  con_dist_1|            15|            25|         28.0|
| 10|1/15/2018|0.006628853|east|  con_dist_1|            15|            25|         28.0|
| 10|1/15/2018|0.006628853|west|  con_dist_2|   0.325261837|   0.655004251| 0.7539269752|
| 10|1/15/2018|0.010680705|west|  con_dist_2|   0.325261837|   0.655004251| 0.7539269752|
| 10|1/15/2018|0.010680705|east|  con_dist_2|   0.350962493|  0.4990442955|  0.749241156|
| 10|1/15/2018| 0.01378215|east|  con_dist_2|   0.350962493|  0.4990442955|  0.749241156|
| 10|1/15/2018|0.006628853|east|  con_dist_2|   0.350962493|  0.4990442955|  0.749241156|
+---+---------+-----------+----+------------+--------------+--------------+-------------+

您可以使用此url的“定义2”验证结果https://www.translatorscafe.com/unit-converter/en-us/calculator/percentile/

jei2mxaa

jei2mxaa1#

使用spark解决此问题的一种简单方法是手动查找与指定百分位值最接近的两个值。这样就可以很容易地计算出分数部分。
在scala中,可以按以下方式进行:
首先,我们得到每一行的排名 zone 除以每组的最高等级。

val w = Window.partitionBy($"zone").orderBy($"date")
val df_zone = df.withColumn("zone_rn", row_number().over(w) - 1)
  .withColumn("zone_rn", $"zone_rn" / max($"zone_rn").over(w))

这将提供:

+---+---------+-----------+----------+-----------+----+-------+
|id |date     |revenue    |con_dist_1|con_dist_2 |zone|zone_rn|
+---+---------+-----------+----------+-----------+----+-------+
|10 |1/15/2018|0.006628853|4         |0.16039063 |west|0.0    |
|10 |1/15/2018|0.810680705|6         |0.819875458|west|1.0    |
|10 |1/15/2018|0.010680705|10        |0.619875458|east|0.0    |
|10 |1/15/2018|0.01378215 |20        |0.082049528|east|0.5    |
|10 |1/15/2018|0.702228853|30        |0.916039063|east|1.0    |
+---+---------+-----------+----------+-----------+----+-------+

我们在所有列上循环考虑并执行 foldLeft 在百分位数上添加每个( lower_val 以及 upper_val ). 我们在同一时间计算分数,然后通过将分数加到下界来计算分位数值。
最后,由于我们在列上循环,我们使用 reduce(_.union(_)) 将所有内容恢复到单个Dataframe。

val percentiles = Seq(0.25, 0.75, 0.90)     // Which percentiles to calculate
val cols = Seq("con_dist_1", "con_dist_2")  // The columns to use

val df_percentiles = cols.map{ c => 
    percentiles.foldLeft(df_zone){ case(df, p) =>  
      df.withColumn("perctile_col", lit(c))
        .withColumn("zone_lower", max(when($"zone_rn" <= p, $"zone_rn")).over(w))
        .withColumn("zone_upper", min(when($"zone_rn" >= p, $"zone_rn")).over(w))
        .withColumn("lower_val", max(when($"zone_lower" === $"zone_rn", col(c))).over(w))
        .withColumn("upper_val", min(when($"zone_upper" === $"zone_rn", col(c))).over(w))
        .withColumn("fraction", (lit(p) - $"zone_lower") / ($"zone_upper" - $"zone_lower"))
        .withColumn(s"quantile_$p", $"lower_val" + $"fraction" * ($"upper_val" - $"lower_val"))
  }
  .drop((cols ++ Seq("zone_rn", "zone_lower", "zone_upper", "lower_val", "upper_val", "fraction")): _*)
}.reduce(_.union(_))

结果:

+---+---------+-----------+----+------------+-------------+------------------+------------------+
| id|     date|    revenue|zone|perctile_col|quantile_0.25|     quantile_0.75|      quantile_0.9|
+---+---------+-----------+----+------------+-------------+------------------+------------------+
| 10|1/15/2018|0.006628853|west|  con_dist_1|          4.5|               5.5|               5.8|
| 10|1/15/2018|0.810680705|west|  con_dist_1|          4.5|               5.5|               5.8|
| 10|1/15/2018|0.010680705|east|  con_dist_1|         15.0|              25.0|              28.0|
| 10|1/15/2018| 0.01378215|east|  con_dist_1|         15.0|              25.0|              28.0|
| 10|1/15/2018|0.702228853|east|  con_dist_1|         15.0|              25.0|              28.0|
| 10|1/15/2018|0.006628853|west|  con_dist_2|  0.325261837|0.6550042509999999|      0.7539269752|
| 10|1/15/2018|0.810680705|west|  con_dist_2|  0.325261837|0.6550042509999999|      0.7539269752|
| 10|1/15/2018|0.010680705|east|  con_dist_2|  0.350962493|      0.4990442955|0.7492411560000001|
| 10|1/15/2018| 0.01378215|east|  con_dist_2|  0.350962493|      0.4990442955|0.7492411560000001|
| 10|1/15/2018|0.702228853|east|  con_dist_2|  0.350962493|      0.4990442955|0.7492411560000001|
+---+---------+-----------+----+------------+-------------+------------------+------------------+

相关问题