如何在pysparkDataframe中通过窗口接收向量列的平均值向量?

sqxo8psd  于 2021-05-18  发布在  Spark
关注(0)|答案(0)|浏览(288)

我有一个pysparkDataframe,带有向量列“features”。我想在一个窗口上添加一个包含特征向量的列。
我知道如何接收整个数据集的均值向量:

from pyspark.sql.functions import col, mean
from pyspark.ml.stat import Summarizer
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation

dataset = [[Vectors.dense([1, 0, 0, -2])],
  [Vectors.dense([4, 5, 0, 3])],
  [Vectors.dense([6, 7, 0, 8])],
  [Vectors.dense([9, 0, 0, 1])]]
v_df = spark.createDataFrame(dataset, ['features'])

df_mean = v_df.select(Summarizer.mean(col('features'))) \
  .collect()

print(df_mean)

我知道如何处理窗口上的数字列:

from pyspark.sql import Window
from pyspark.sql import functions as func
from pyspark.sql import SQLContext

df = spark.createDataFrame([(1, 'a', 5), (2, 'a', 8), (3, 'b', 10), (1, 'b', 15), (2, 'a', 18), (3, 'c', 10)], ['order', 'category', 'signal'])

window = Window.partitionBy('category').orderBy("order").rowsBetween(Window.currentRow, 2)
mean_df = df.select(
        mean(col('signal')).over(window)
    )

mean_df.show()

但我如何才能做到这一点矢量窗口?我想做一些类似于

from pyspark.sql.functions import col, mean
from pyspark.ml.stat import Summarizer
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation

dataset = [[1, 'a', Vectors.dense([1, 0, 0, -2])],
  [2, 'a', Vectors.dense([4, 5, 0, 3])],
  [2, 'b', Vectors.dense([6, 7, 0, 8])],
  [1, 'b', Vectors.dense([9, 0, 0, 1])]]
v_df = spark.createDataFrame(dataset, ['order', 'category', 'features'])

window = Window.partitionBy('category').orderBy("order").rowsBetween(Window.currentRow, 2)

df_mean = v_df.select(Summarizer.mean(col('features')).over(window)) \
  .collect()

print(df_mean)

但它不起作用。
有没有一个优雅的方法来实现它?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题