将word2vec输出转换为数据集[\ux]

ryoqjall 于 2021-05-22 发布在 Spark

关注(0)|答案(2)|浏览(555)

我认为case类类型应该与dataframe匹配。但是，我不知道我的案例类应该是什么类型的 text 列？
我的代码如下：

case class vectorData(value: Array[String], vectors: Array[Float])
def main(args: Array[String]) {
    val word2vec = new Word2Vec()
        .setInputCol("value").setOutputCol("vectors")
        .setVectorSize(5).setMinCount(0).setWindowSize(5)
    val dataset = spark.createDataset(data)

    val model = word2vec.fit(dataset)

    val encoder = org.apache.spark.sql.Encoders.product[vectorData]
    val result = model.transform(dataset)

    result.foreach(row => println(row.get(0)))
    println("###################################")
    result.foreach(row => println(row.get(1)))

    val output  = result.as(encoder)
}

如图所示，当我打印第一列时，我得到：

WrappedArray(@marykatherine_q, know!, I, heard, afternoon, wondered, thing., Moscow, times)
WrappedArray(laying, bed, voice..)
WrappedArray(I'm, sooo, sad!!!, killed, Kutner, House, whyyyyyyyy)

当我打印第二列时，我得到：

[-0.0495405454809467,0.03403271486361821,0.011959535030958552,-0.008446224654714266,0.0014322120696306229]
[-0.06924172700382769,0.02562551060691476,0.01857258938252926,-0.0269106051127892,-0.011274430900812149]
[-0.06266747579416808,0.007715661790879334,0.047578315007472956,-0.02747830021989477,-0.015755867421188775]

我得到的错误是：

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`text`' given input columns: [result, value];

很明显，我的case类的类型与实际结果不匹配。正确的应该是什么？我想要 val output 成为 DataSet[_] .
谢谢您
编辑：
我修改了case类列名，使之与word2vec输出相同。现在我得到一个错误：

Exception in thread "main" org.apache.spark.sql.AnalysisException: need an array field but got struct<type:tinyint,size:int,indices:array<int>,values:array<double>>;

scala apache-spark word2vec

来源：https://stackoverflow.com/questions/64313063/scala-spark-convert-word2vec-output-to-dataset

2条答案

按热度按时间

r3i60tvu1#

在我看来，这只是一个属性命名的问题。spark告诉你的是它找不到属性 text 在Dataframe中 result .
你没有说你是如何创建 data 但它必须有一个属性 value 自 Word2vec 设法找到它。 model.transform 只是添加了一个 result 列，并将其转换为以下类型的Dataframe：

root
 |-- value: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- vector: array (nullable = true)
 |    |-- element: float (containsNull = false)
 |-- result: vector (nullable = true)

所以当你试图把它变成一个数据集，spark试图找到一个 text 列并引发该异常。只需重命名 value 列，它将工作：

val output = result.withColumnRenamed("value", "text").as(encoder)

赞(0）回复(0）举报 2021-05-23

kiayqfof2#

在检查了源代码之后 word2vec ，我设法意识到转换的输出实际上不是数组[float]，而是 Vector （来自o.a.s.ml.linalg）。
它的工作原理是将case类更改如下：

case class vectorData(value: Array[String], vectors: Vector)

赞(0）回复(0）举报 2021-05-23

我来回答

将word2vec输出转换为数据集[\ux]

2条答案

相关问题

热门标签

最新问答