spark illegalargumentexception:列功能必须是struct类型< type:tinyint,size:int,indices:array< int>,values:array< double>>

2j4z5cfb  于 2021-05-24  发布在  Spark
关注(0)|答案(1)|浏览(2361)

我试着用 org.apache.spark.ml.regression.LinearRegression 符合我的数据。所以我把原始的rdd转换成了dataframe,并试图把它输入到线性回归模型中。

  1. val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
  2. val parsedData = dataRDD.map{
  3. item =>
  4. val doubleArray = Array(item._1.toDouble, item._2.toDouble, item._3.toDouble)
  5. val features = Vectors.dense(doubleArray)
  6. Row(item._4.toDouble, features)
  7. }
  8. val schema = List(
  9. StructField("label", DoubleType, true),
  10. StructField("features", new org.apache.spark.mllib.linalg.VectorUDT, true)
  11. )
  12. val df = spark.createDataFrame(
  13. parsedData,
  14. StructType(schema)
  15. )
  16. val lr = new LinearRegression()
  17. .setMaxIter(10)
  18. .setRegParam(0.3)
  19. .setElasticNetParam(0.8)
  20. val lr_model = lr.fit(df)

下面是Dataframe的样子:

  1. +---------+-------------+
  2. | label| features|
  3. +---------+-------------+
  4. | 5.0|[0.0,1.0,0.0]|
  5. | 20.0|[0.0,1.0,0.0]|
  6. | 689.0|[0.0,1.0,0.0]|
  7. | 627.0|[0.0,1.0,0.0]|
  8. | 127.0|[0.0,1.0,0.0]|
  9. | 0.0|[0.0,1.0,0.0]|
  10. | 0.0|[0.0,1.0,0.0]|
  11. | 0.0|[0.0,1.0,0.0]|
  12. | 76.0|[0.0,1.0,0.0]|
  13. | 5.0|[0.0,1.0,0.0]|
  14. | 0.0|[0.0,1.0,0.0]|
  15. | 0.0|[0.0,1.0,0.0]|
  16. | 0.0|[0.0,1.0,0.0]|
  17. | 0.0|[0.0,1.0,0.0]|
  18. | 0.0|[0.0,1.0,0.0]|
  19. | 2.0|[0.0,1.0,0.0]|
  20. | 329.0|[0.0,1.0,0.0]|
  21. |2354115.0|[0.0,1.0,0.0]|
  22. | 5.0|[0.0,1.0,0.0]|
  23. | 4303.0|[0.0,1.0,0.0]|
  24. +---------+-------------+

但它给出了下面的错误。

  1. java.lang.IllegalArgumentException: requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.

后面的数据类型似乎与所需的数据类型没有什么不同。有人能帮忙吗?

z4iuyo4d

z4iuyo4d1#

您正在使用 org.apache.spark.ml.regression.LinearRegression (sparkml)的旧版本 VectorUDT (mllib已弃用)并且它们似乎不能一起工作。
替换 new org.apache.spark.mllib.linalg.VectorUDTnew org.apache.spark.ml.linalg.VectorUDT 它应该有用。
注意,为了避免声明模式,可以使用 toDF (导入spark的隐式之后)让spark推断正确的类型( org.apache.spark.ml.linalg.VectorUDT )对你来说:

  1. import org.apache.spark.ml.linalg.Vectors
  2. import spark.implicits._
  3. val df = dataRDD.map{ item =>
  4. val doubleArray = Array(item._1.toDouble, item._2.toDouble, item._3.toDouble)
  5. val features = Vectors.dense(doubleArray)
  6. (item._4.toDouble, features)
  7. }.toDF("label", "features")

相关问题