数据集很大,包含30列和200000条记录。我正在使用sparkr构建glm模型,但是模型执行花费了太多时间,而且也会出错。。如何使用sparkr减少模型构建时间并解决下面给出的这个错误。请给我一些改进代码的建议。
r代码:设置spark home
Sys.setenv(SPARK_HOME="C:/spark/spark-2.0.0-bin-hadoop2.7")
设置库路径
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk1.7.0_71")
加载sparkr库
library(SparkR)
library(rJava)
sc <- sparkR.session(enableHiveSupport = FALSE,master = "local[*]",appName = "SparkR-Modi",sparkConfig = list(spark.sql.warehouse.dir="file:///c:/tmp/spark-warehouse"))
sqlContext <- sparkRSQL.init(sc)
spdf <- read.df(sqlContext, "C:/Users/prasann/Desktop/V/bigdata11.csv", source = "com.databricks.spark.csv", header = "true")
showDF(spdf)
glm模型
md <- glm(NP_OfferCurrentResponse ~., family = "binomial", data = spdf)
错误:(模型执行非常慢,出现错误)
> md <- glm(NP_OfferCurrentResponse ~., family = "binomial", data = spdf)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.AssertionError: assertion failed: lapack.dppsv returned 226.
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40)
at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:140)
at org.apache.spark.ml.regression.GeneralizedLinearRegression$FamilyAndLink.initialize(GeneralizedLinearRegression.scala:340)
at org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:275)
at org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:139)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:145)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.c
暂无答案!
目前还没有任何答案,快来回答吧!