我试着用 org.apache.spark.ml.regression.LinearRegression 符合我的数据。所以我把原始的rdd转换成了dataframe，并试图把它输入到线性回归模型中。

val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val parsedData = dataRDD.map{
  item =>
    val doubleArray = Array(item._1.toDouble, item._2.toDouble, item._3.toDouble)
    val features = Vectors.dense(doubleArray)
    Row(item._4.toDouble, features)
}
val schema = List(
  StructField("label", DoubleType, true),
  StructField("features", new org.apache.spark.mllib.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
  parsedData,
  StructType(schema)
)
val lr = new LinearRegression()
  .setMaxIter(10)
  .setRegParam(0.3)
  .setElasticNetParam(0.8)
val lr_model = lr.fit(df)

下面是Dataframe的样子：

+---------+-------------+
|    label|     features|
+---------+-------------+
|      5.0|[0.0,1.0,0.0]|
|     20.0|[0.0,1.0,0.0]|
|    689.0|[0.0,1.0,0.0]|
|    627.0|[0.0,1.0,0.0]|
|    127.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|     76.0|[0.0,1.0,0.0]|
|      5.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      0.0|[0.0,1.0,0.0]|
|      2.0|[0.0,1.0,0.0]|
|    329.0|[0.0,1.0,0.0]|
|2354115.0|[0.0,1.0,0.0]|
|      5.0|[0.0,1.0,0.0]|
|   4303.0|[0.0,1.0,0.0]|
+---------+-------------+

但它给出了下面的错误。

java.lang.IllegalArgumentException: requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.

后面的数据类型似乎与所需的数据类型没有什么不同。有人能帮忙吗？

import org.apache.spark.ml.linalg.Vectors
import spark.implicits._
val df = dataRDD.map{ item =>
val doubleArray = Array(item._1.toDouble, item._2.toDouble, item._3.toDouble)
val features = Vectors.dense(doubleArray)
(item._4.toDouble, features)
}.toDF("label", "features")

1条答案

按热度按时间

z4iuyo4d1#

您正在使用 org.apache.spark.ml.regression.LinearRegression （sparkml）的旧版本 VectorUDT （mllib已弃用）并且它们似乎不能一起工作。
替换 new org.apache.spark.mllib.linalg.VectorUDT 由 new org.apache.spark.ml.linalg.VectorUDT 它应该有用。
注意，为了避免声明模式，可以使用 toDF （导入spark的隐式之后）让spark推断正确的类型( org.apache.spark.ml.linalg.VectorUDT )对你来说：

import org.apache.spark.ml.linalg.Vectors
import spark.implicits._
val df = dataRDD.map{ item =>
    val doubleArray = Array(item._1.toDouble, item._2.toDouble, item._3.toDouble)
    val features = Vectors.dense(doubleArray)
    (item._4.toDouble, features)
}.toDF("label", "features")

赞(0）回复(0）举报 2021-05-25

spark illegalargumentexception:列功能必须是struct类型< type:tinyint,size:int,indices:array< int>，values:array< double>>

1条答案

相关问题

热门标签

最新问答