pyspark 类型错误:无法将类型转换为向量

rsl1atfo  于 2022-12-17  发布在  Spark
关注(0)|答案(1)|浏览(316)

我有一个包含多行的 Dataframe ,如下所示:df.head()给出:

Row(features=DenseVector([1.02, 4.23, 4.534, 0.342]))

现在我想在 Dataframe 上计算columnSimilarities(),我执行以下操作:

rdd2 = df.rdd
mat = RowMatrix(rdd2)
sims = mat.columnSimilarities()

但是,我得到了以下错误:

File "/opt/apache-spark/spark-3.2.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py", line 67, in _convert_to_vector
    raise TypeError("Cannot convert type %s into Vector" % type(l))
TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector

有人能帮我一下吗?谢谢!

efzxgjgh

efzxgjgh1#

当前的rdd形式为:

Row(features=DenseVector([1.02, 4.23, 4.534, 0.342]))

根据official documentation中的示例,如果我们以以下形式获取它,它将工作:

[DenseVector([1.02, 4.23, 4.534, 0.342])]

将行矩阵构造为:

RowMatrix(df.rdd.map(list))

下面是一个完整的示例,它报告并修复了您的问题:

df = spark.createDataFrame(data=[([1.02, 4.23, 4.534, 0.342],)], schema=["features"])

from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT
@udf(returnType=VectorUDT())
def arrayToVector(arrCol):
  from pyspark.ml.linalg import Vectors
  return Vectors.dense(arrCol)
# 

df = df.withColumn("features", arrayToVector("features"))
# print(df.head())
# df.printSchema()

# mat = RowMatrix(df.rdd) # Causes TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector
mat = RowMatrix(df.rdd.map(list))
sims = mat.columnSimilarities()
print(sims.entries.collect())

[Out]:
[MatrixEntry(2, 3, 1.0), MatrixEntry(0, 1, 1.0), MatrixEntry(1, 2, 1.0), MatrixEntry(0, 3, 1.0), MatrixEntry(1, 3, 1.0), MatrixEntry(0, 2, 1.0)]

相关问题