我有一个pysparkDataframe,带有向量列“features”。我想在一个窗口上添加一个包含特征向量的列。
我知道如何接收整个数据集的均值向量:
from pyspark.sql.functions import col, mean
from pyspark.ml.stat import Summarizer
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
dataset = [[Vectors.dense([1, 0, 0, -2])],
[Vectors.dense([4, 5, 0, 3])],
[Vectors.dense([6, 7, 0, 8])],
[Vectors.dense([9, 0, 0, 1])]]
v_df = spark.createDataFrame(dataset, ['features'])
df_mean = v_df.select(Summarizer.mean(col('features'))) \
.collect()
print(df_mean)
我知道如何处理窗口上的数字列:
from pyspark.sql import Window
from pyspark.sql import functions as func
from pyspark.sql import SQLContext
df = spark.createDataFrame([(1, 'a', 5), (2, 'a', 8), (3, 'b', 10), (1, 'b', 15), (2, 'a', 18), (3, 'c', 10)], ['order', 'category', 'signal'])
window = Window.partitionBy('category').orderBy("order").rowsBetween(Window.currentRow, 2)
mean_df = df.select(
mean(col('signal')).over(window)
)
mean_df.show()
但我如何才能做到这一点矢量窗口?我想做一些类似于
from pyspark.sql.functions import col, mean
from pyspark.ml.stat import Summarizer
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
dataset = [[1, 'a', Vectors.dense([1, 0, 0, -2])],
[2, 'a', Vectors.dense([4, 5, 0, 3])],
[2, 'b', Vectors.dense([6, 7, 0, 8])],
[1, 'b', Vectors.dense([9, 0, 0, 1])]]
v_df = spark.createDataFrame(dataset, ['order', 'category', 'features'])
window = Window.partitionBy('category').orderBy("order").rowsBetween(Window.currentRow, 2)
df_mean = v_df.select(Summarizer.mean(col('features')).over(window)) \
.collect()
print(df_mean)
但它不起作用。
有没有一个优雅的方法来实现它?
暂无答案!
目前还没有任何答案,快来回答吧!