我试着用 org.apache.spark.ml.regression.LinearRegression
符合我的数据。所以我把原始的rdd转换成了dataframe,并试图把它输入到线性回归模型中。
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val parsedData = dataRDD.map{
item =>
val doubleArray = Array(item._1.toDouble, item._2.toDouble, item._3.toDouble)
val features = Vectors.dense(doubleArray)
Row(item._4.toDouble, features)
}
val schema = List(
StructField("label", DoubleType, true),
StructField("features", new org.apache.spark.mllib.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
parsedData,
StructType(schema)
)
val lr = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val lr_model = lr.fit(df)
下面是Dataframe的样子:
+---------+-------------+
| label| features|
+---------+-------------+
| 5.0|[0.0,1.0,0.0]|
| 20.0|[0.0,1.0,0.0]|
| 689.0|[0.0,1.0,0.0]|
| 627.0|[0.0,1.0,0.0]|
| 127.0|[0.0,1.0,0.0]|
| 0.0|[0.0,1.0,0.0]|
| 0.0|[0.0,1.0,0.0]|
| 0.0|[0.0,1.0,0.0]|
| 76.0|[0.0,1.0,0.0]|
| 5.0|[0.0,1.0,0.0]|
| 0.0|[0.0,1.0,0.0]|
| 0.0|[0.0,1.0,0.0]|
| 0.0|[0.0,1.0,0.0]|
| 0.0|[0.0,1.0,0.0]|
| 0.0|[0.0,1.0,0.0]|
| 2.0|[0.0,1.0,0.0]|
| 329.0|[0.0,1.0,0.0]|
|2354115.0|[0.0,1.0,0.0]|
| 5.0|[0.0,1.0,0.0]|
| 4303.0|[0.0,1.0,0.0]|
+---------+-------------+
但它给出了下面的错误。
java.lang.IllegalArgumentException: requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
后面的数据类型似乎与所需的数据类型没有什么不同。有人能帮忙吗?
1条答案
按热度按时间z4iuyo4d1#
您正在使用
org.apache.spark.ml.regression.LinearRegression
(sparkml)的旧版本VectorUDT
(mllib已弃用)并且它们似乎不能一起工作。替换
new org.apache.spark.mllib.linalg.VectorUDT
由new org.apache.spark.ml.linalg.VectorUDT
它应该有用。注意,为了避免声明模式,可以使用
toDF
(导入spark的隐式之后)让spark推断正确的类型(org.apache.spark.ml.linalg.VectorUDT
)对你来说: