spark窗口函数,这是spark中的错误吗?

sd2nnvve  于 2021-05-29  发布在  Spark
关注(0)|答案(3)|浏览(548)

有人能验证这是否是spark中的错误吗?或者我用pyspark窗口函数做了什么可怕的错误:
以下是Dataframe:

下面是我正在运行的代码,用于替换post\ evar8列中的空值:

win_mid_desc_ts = Window.partitionBy('post_visid_high_low').orderBy(desc('hit_time_gmt'))
step3win = step3win.withColumn("post_evar8", last(col('post_evar8'), ignorenulls=True).over(win_mid_desc_ts))
step3win.orderBy("visit_page_num").show(100, truncate=False)

运行上述代码后,我得到以下结果:

如您所见,window函数更新了post\u evar8列中的空值,但也将184545857替换为32526519(请访问第26页和第27页)。不知道为什么要替换184545857值。
这里是json中相同的Dataframe(可以复制并粘贴到文件中)

{"post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524187","visit_page_num":1}
{"post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524197","visit_page_num":2}
{"post_evar8":"32526519","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524197","visit_page_num":3}
{"post_evar8":"32526519","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524568","visit_page_num":14}
{"post_evar8":"32526519","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524568","visit_page_num":15}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524599","visit_page_num":18}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524599","visit_page_num":19}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524599","visit_page_num":20}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524599","visit_page_num":21}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590525921","visit_page_num":26}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590525921","visit_page_num":27}
{"post_evar8":"32526519","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590525921","visit_page_num":28}

<

nqwrtyyt

nqwrtyyt1#

不,不是虫子。
因为申请 partitionBypost_visid_high_low 在数据框中具有相同值的列将把数据框中的整个数据视为一个分区,您将按该分区应用排序 hit_time_gmt 降序,最终结果将按如下顺序排列。

>>> df.orderBy(F.desc("hit_time_gmt")).show(truncate=False)
+------------+----------+---------------------------------------+--------------+
|hit_time_gmt|post_evar8|post_visid_high_low                    |visit_page_num|
+------------+----------+---------------------------------------+--------------+
|1590525921  |184545857 |3283497750620215155_4391202461688050070|26            |
|1590525921  |32526519  |3283497750620215155_4391202461688050070|28            |
|1590525921  |184545857 |3283497750620215155_4391202461688050070|27            |
|1590524568  |32526519  |3283497750620215155_4391202461688050070|15            |
|1590524568  |32526519  |3283497750620215155_4391202461688050070|14            |
|1590524197  |null      |3283497750620215155_4391202461688050070|2             |
|1590524197  |32526519  |3283497750620215155_4391202461688050070|3             |
|1590524187  |null      |3283497750620215155_4391202461688050070|1             |
+------------+----------+---------------------------------------+--------------+

根据上述结果 post_evar832526519 相同的值将替换为中的其他值 post_evar8 列。
在这个函数中添加更多不同的值 post_visid_high_low 列尝试运行相同的代码并检查。

os8fio9y

os8fio9y2#

-更新更多示例:

4bbkushb

4bbkushb3#

->
下面是一个唯一命中时间gmt(黄色框)值的示例,后评估8看起来正确(红色框)。

下面是修改命中时间gmt(黄色框)的示例,因此有2个相似的(1590525922)和1个唯一的(1590525921)。中间的post_evar8值从184545857更新为32526519(红色框)。这是错误的。

在这个窗口函数中,我只想更新post\ evar8中的空值(不是已经填充的值)。在所有情况下,这看起来都是正确的(绿框)。hit\u time\u gmt只是提供命令,为什么hit\u time\u gmt中的值会更改post\u evar8的值(在红色框中)?

相关问题