我有以下递归函数,它使用InterQuartileRange方法确定离群值:
def interQuartileRangeFiltering(df: DataFrame): DataFrame = {
@scala.annotation.tailrec
def inner(cols: List[String], acc: DataFrame): DataFrame = cols match {
case Nil => acc
case column :: xs =>
val quantiles = acc.stat.approxQuantile(column, Array(0.25, 0.75), 0.0) // TODO: values should come from config
println(s"$column ${quantiles.size}")
val q1 = quantiles(0)
val q3 = quantiles(1)
val iqr = q1 - q3
val lowerRange = q1 - 1.5 * iqr
val upperRange = q3 + 1.5 * iqr
val filtered = acc.filter(s"$column < $lowerRange or $column > $upperRange")
inner(xs, filtered)
}
inner(df.columns.toList, df)
}
val outlierDF = interQuartileRangeFiltering(incomingDF)
所以基本上我所做的就是递归地迭代列并消除异常值。奇怪的是,它会导致ArrayIndexOutOfBounds Exception并打印以下内容:
housing_median_age 2
inland 2
island 2
population 2
total_bedrooms 2
near_bay 2
near_ocean 2
median_house_value 0
java.lang.ArrayIndexOutOfBoundsException: 0
at inner$1(<console>:75)
at interQuartileRangeFiltering(<console>:83)
... 54 elided
我的方法有什么问题吗?
2条答案
按热度按时间3zwjbxry1#
kcrjzv8t2#
以下是我的想法,效果很好: