我想使用pyspark对巨大的数据集进行groupby和滚动平均。不习惯pyspark,我很难看到我的错误。为什么这不起作用?
data = pd.DataFrame({'group':['A']*5+['B']*5,
'order':[1,2,3,4,5, 1,2,3,4,5],
'value':[23, 54, 65, 64, 78, 98, 78, 76, 77, 57]})
spark_df = spark.createDataFrame(data)
window_spec = Window.partitionBy("group").orderBy("order").rowsBetween(-1, 0)
# Calculate the rolling average of col_value
rolling_avg = avg(col("value")).over(window_spec).alias("value_rolling_avg")
# Group by col_group and col_date and calculate the rolling average of col_value
spark_df.groupby("group").agg(rolling_avg).show()
AnalysisException: [COLUMN_NOT_IN_GROUP_BY_CLAUSE] The expression "order" is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in `first()` (or `first_value()`) if you don't care which value you get.;
1条答案
按热度按时间zpgglvta1#
您正在混合使用窗口函数和聚合函数。
你可以简单地选择
rolling_avg
来得到你的结果:或