pyspark Spark中的Groupby、Window和滚动平均

kokeuurv  于 2023-04-11  发布在  Spark
关注(0)|答案(1)|浏览(194)

我想使用pyspark对巨大的数据集进行groupby和滚动平均。不习惯pyspark,我很难看到我的错误。为什么这不起作用?

data = pd.DataFrame({'group':['A']*5+['B']*5,
 'order':[1,2,3,4,5, 1,2,3,4,5],
'value':[23, 54, 65, 64, 78, 98, 78, 76, 77, 57]})

spark_df = spark.createDataFrame(data)

window_spec = Window.partitionBy("group").orderBy("order").rowsBetween(-1, 0)

# Calculate the rolling average of col_value
rolling_avg = avg(col("value")).over(window_spec).alias("value_rolling_avg")

# Group by col_group and col_date and calculate the rolling average of col_value
spark_df.groupby("group").agg(rolling_avg).show()
AnalysisException: [COLUMN_NOT_IN_GROUP_BY_CLAUSE] The expression "order" is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in `first()` (or `first_value()`) if you don't care which value you get.;
zpgglvta

zpgglvta1#

您正在混合使用窗口函数和聚合函数。
你可以简单地选择rolling_avg来得到你的结果:

spark_df.select(rolling_avg).show()

+-----------------+
|value_rolling_avg|
+-----------------+
|             23.0|
|             38.5|
|             59.5|
|             64.5|
|             71.0|
|             98.0|
|             88.0|
|             77.0|
|             76.5|
|             67.0|
+-----------------+

spark_df.withColumn("value_rolling_avg", rolling_avg).show()

+------+------+------+-----------------+
|group_|order_|value_|value_rolling_avg|
+------+------+------+-----------------+
|     A|     1|    23|             23.0|
|     A|     2|    54|             38.5|
|     A|     3|    65|             59.5|
|     A|     4|    64|             64.5|
|     A|     5|    78|             71.0|
|     B|     1|    98|             98.0|
|     B|     2|    78|             88.0|
|     B|     3|    76|             77.0|
|     B|     4|    77|             76.5|
|     B|     5|    57|             67.0|
+------+------+------+-----------------+

相关问题