Suppose I have two different windows with the same partitioning:
window1 = Window.partitionBy("id")
window2 = Window.partitionBy("id").orderBy("date")
And then I call several consecutive window functions using them:
df.withColumn("col1", F.sum("x").over(window1))
.withColumn("col2", F.first("x").over(window2))
And suppose df
is not partitioned by id
.
Will the computation of col2
cause another shuffle or will it reuse the same partitioning?
Does adding
df.repartition("id")
before the computation cause any performance improvement?
1条答案
按热度按时间ohfgkhjo1#
TLDR: only one shuffle occurs,
repartition
is useless here.That's actually quite easy to verify.
Now, let us use
explain
on your code:As you can see, only one Exchange (=shuffle). Adding
repartition
yields the exact same execution plan. No change at all:The takeaway here is that regardless of the way it does it, spark remembers that it has partitioned the data and how it did it to avoid having to do it again.
Finally, notice that with
window3
, that needs a different partitioning, we have two shuffles: