我使用spark mllib和scala加载csv文件,并将特征向量中的特征转换为用于训练某些模型;为此,我使用以下代码:
// Loading the data
val rawData = spark.read.option("header", "true").csv(data) // id, feat0, feat1, feat2,...
val rawLabels = spark.read.option("header", "true").csv(labels) // id, label
val rawDataSet = rawData.join(rawLabels,"id")
// Set features columns
val featureCols = rawTrainingDataSet.columns.drop(1) // drop the id column
// TypeString in the csv columns so need to cast to Double
val exprs = featureCols.map(c => col(c).cast("Double"))
// Assembler taking a sample of just 5 columns; it should use "featureCols" as parameter value for "setInputCols" in the real case
val assembler = new VectorAssembler()
.setInputCols(Array("feat0", "feat1", "feat2", "feat3", "feat4", "feat5"))
.setOutputCol("features")
// Select all the column values to create the "features" column with them
val result = assembler.transform(rawTrainingDataSet.select(exprs: _*)).select($"features")
result.show(5,false)
这是工作,但我没有得到预期的结果为功能列中显示的文档https://spark.apache.org/docs/2.4.4/ml-features.html#vectorassembler; 相反,我得到的是:
feat0|feat1|feat2|feat3|feat4|feat5| features
39.0 |0.0 | 1.0| 0.0| 0.0| 1.0| [39.0,0.0,1.0,1.0,0.0,0.0]
29.0 |0.0 | 1.0| 0.0| 0.0| 0.0| (6,[0,2],[29.0,1.0])
53.0 |1.0 | 0.0| 0.0| 0.0| 0.0| (6,[0,1],[53.0,1.0])
31.0 |0.0 | 1.0| 0.0| 0.0| 1.0| (6,[0,2,5],[31.0,1.0,1.0])
37.0 |0.0 | 1.0| 0.0| 0.0| 0.0| (6,[0,2],[37.0,1.0])
如您所见,对于features列,我得到的是(number \u of \u features,[index \u for \u non \u 0 \u features],[value \u for \u non \u zero \u features]),但是对于第一行,我得到了所有dataframe行的期望值和我想要的值,一个包含所有列值的数组,不管它们是否为零值。你能给我一些提示,让我知道我做错了什么吗?
谢谢您!!
1条答案
按热度按时间rsl1atfo1#
将稀疏向量转换为密集向量,如下所示-