如何使用scala对spark中的稀疏向量进行洗牌

0vvn1miw  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(577)

我在spark中有一个稀疏向量,我想随机洗牌(重新排序)它的内容。这个向量实际上是一个tf-idf向量,我想要的是对它重新排序,以便在我的新数据集中,特征具有不同的顺序。使用scala有什么方法可以做到这一点吗?这是我生成tf-idf向量的代码:

val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(data).cache()
val cvModel: CountVectorizerModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("rawFeatures")
  .fit(wordsData)
val featurizedData = cvModel.transform(wordsData).cache()
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData).cache()
dgiusagp

dgiusagp1#

也许这是有用的-

加载测试数据

val data = Array(
      Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
      Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
      Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
    )
    val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
    df.show(false)
    df.printSchema()

    /**
      * +---------------------+
      * |features             |
      * +---------------------+
      * |(5,[1,3],[1.0,7.0])  |
      * |[2.0,0.0,3.0,4.0,5.0]|
      * |[4.0,0.0,0.0,6.0,7.0]|
      * +---------------------+
      *
      * root
      * |-- features: vector (nullable = true)
      */

洗牌向量

val shuffleVector = udf((vector: Vector) =>
     Vectors.dense(scala.util.Random.shuffle(mutable.WrappedArray.make[Double](vector.toArray)).toArray)
   )

    val p = df.withColumn("shuffled_vector", shuffleVector($"features"))
    p.show(false)
    p.printSchema()

    /**
      * +---------------------+---------------------+
      * |features             |shuffled_vector      |
      * +---------------------+---------------------+
      * |(5,[1,3],[1.0,7.0])  |[1.0,0.0,0.0,0.0,7.0]|
      * |[2.0,0.0,3.0,4.0,5.0]|[0.0,3.0,2.0,5.0,4.0]|
      * |[4.0,0.0,0.0,6.0,7.0]|[4.0,7.0,6.0,0.0,0.0]|
      * +---------------------+---------------------+
      *
      * root
      * |-- features: vector (nullable = true)
      * |-- shuffled_vector: vector (nullable = true)
      */

你也可以用上面的 udf 创造 Transformer 把它放到管道里
请务必使用 import org.apache.spark.ml.linalg._ ###update-1将无序向量转换为稀疏向量

val shuffleVectorToSparse = udf((vector: Vector) =>
      Vectors.dense(scala.util.Random.shuffle(mutable.WrappedArray.make[Double](vector.toArray)).toArray).toSparse
    )

    val p1 = df.withColumn("shuffled_vector", shuffleVectorToSparse($"features"))
    p1.show(false)
    p1.printSchema()

    /**
      * +---------------------+-------------------------------+
      * |features             |shuffled_vector                |
      * +---------------------+-------------------------------+
      * |(5,[1,3],[1.0,7.0])  |(5,[0,3],[1.0,7.0])            |
      * |[2.0,0.0,3.0,4.0,5.0]|(5,[1,2,3,4],[5.0,3.0,2.0,4.0])|
      * |[4.0,0.0,0.0,6.0,7.0]|(5,[1,3,4],[7.0,4.0,6.0])      |
      * +---------------------+-------------------------------+
      *
      * root
      * |-- features: vector (nullable = true)
      * |-- shuffled_vector: vector (nullable = true)
      */

相关问题