extract-feature列生成一个(numberoffeatures,array[nonzerofeatindexes],array[nonzerofeatures]),而不是这些列的数组

mmvthczy  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(287)

我使用spark mllib和scala加载csv文件,并将特征向量中的特征转换为用于训练某些模型;为此,我使用以下代码:

// Loading the data
val rawData = spark.read.option("header", "true").csv(data)      // id, feat0, feat1, feat2,...
val rawLabels = spark.read.option("header", "true").csv(labels)  // id, label
val rawDataSet = rawData.join(rawLabels,"id")

// Set features columns
val featureCols = rawTrainingDataSet.columns.drop(1) // drop the id column

// TypeString in the csv columns so need to cast to Double
val exprs = featureCols.map(c => col(c).cast("Double"))

// Assembler taking a sample of just 5 columns; it should use "featureCols" as parameter value for "setInputCols" in the real case
val assembler = new VectorAssembler()
  .setInputCols(Array("feat0", "feat1", "feat2", "feat3", "feat4", "feat5"))
  .setOutputCol("features")

// Select all the column values to create the "features" column with them
val result = assembler.transform(rawTrainingDataSet.select(exprs: _*)).select($"features")
result.show(5,false)

这是工作,但我没有得到预期的结果为功能列中显示的文档https://spark.apache.org/docs/2.4.4/ml-features.html#vectorassembler; 相反,我得到的是:

feat0|feat1|feat2|feat3|feat4|feat5| features
39.0 |0.0  |  1.0|  0.0|  0.0|  1.0| [39.0,0.0,1.0,1.0,0.0,0.0]
29.0 |0.0  |  1.0|  0.0|  0.0|  0.0| (6,[0,2],[29.0,1.0])
53.0 |1.0  |  0.0|  0.0|  0.0|  0.0| (6,[0,1],[53.0,1.0])
31.0 |0.0  |  1.0|  0.0|  0.0|  1.0| (6,[0,2,5],[31.0,1.0,1.0])
37.0 |0.0  |  1.0|  0.0|  0.0|  0.0| (6,[0,2],[37.0,1.0])

如您所见,对于features列,我得到的是(number \u of \u features,[index \u for \u non \u 0 \u features],[value \u for \u non \u zero \u features]),但是对于第一行,我得到了所有dataframe行的期望值和我想要的值,一个包含所有列值的数组,不管它们是否为零值。你能给我一些提示,让我知道我做错了什么吗?
谢谢您!!

rsl1atfo

rsl1atfo1#

将稀疏向量转换为密集向量,如下所示-

val sparseToDense =
      udf((v: org.apache.spark.ml.linalg.Vector) => v.toDense)

      result.withColumn("features_dense", sparseToDense(col("features")));

相关问题