Apache Spark 标准定标器返回NaN

vcirk6k6 于 2022-11-16 发布在 Apache

关注(0)|答案(2)|浏览(127)

环境：

spark-1.6.0 with scala-2.10.4

使用方法：

// row of df : DataFrame = (String,String,double,Vector) as (id1,id2,label,feature)
val df = sqlContext.read.parquet("data/Labeled.parquet")
val SC = new StandardScaler()
.setInputCol("feature").setOutputCol("scaled")
.setWithMean(false).setWithStd(true).fit(df) 

val scaled = SC.transform(df)
.drop("feature").withColumnRenamed("scaled","feature")

代码作为例子在这里http://spark.apache.org/docs/latest/ml-features.html#standardscaler
NaN存在于scaled、SC.mean和SC.std中
我不明白为什么StandardScaler甚至在mean中也能做到这一点，也不知道如何处理这种情况。
parquet 的数据大小为1.6GiB，如果有人需要，请告诉我
最新消息：
通过StandardScaler的代码，当MultivariateOnlineSummarizer聚合时，这可能是Double的精度问题。

apache-spark

来源：https://stackoverflow.com/questions/35573681/standardscaler-returns-nan

2条答案

按热度按时间

llmtgqce1#

存在等于Double.MaxValue的值，当StandardScaler对列求和时，结果溢出。
只需将那些列转换为scala.math.BigDecimal即可。
请参阅此处：
http://www.scala-lang.org/api/current/index.html#scala.math.BigDecimal

赞(0）回复(0）举报 2022-11-16

pw9qyyiw2#

当遇到同样的问题时，我尝试过的一件事是在标准化过程之后，从我正在操作的两个操作系统 Dataframe 中重置索引：

`df = df.reset_index() 
`df_norm = df_norm.reset_index()

赞(0）回复(0）举报 2022-11-16

我来回答

Apache Spark 标准定标器返回NaN

2条答案

相关问题

热门标签

最新问答