有人能验证这是否是spark中的错误吗?或者我用pyspark窗口函数做了什么可怕的错误:
以下是Dataframe:
下面是我正在运行的代码,用于替换post\ evar8列中的空值:
win_mid_desc_ts = Window.partitionBy('post_visid_high_low').orderBy(desc('hit_time_gmt'))
step3win = step3win.withColumn("post_evar8", last(col('post_evar8'), ignorenulls=True).over(win_mid_desc_ts))
step3win.orderBy("visit_page_num").show(100, truncate=False)
运行上述代码后,我得到以下结果:
如您所见,window函数更新了post\u evar8列中的空值,但也将184545857替换为32526519(请访问第26页和第27页)。不知道为什么要替换184545857值。
这里是json中相同的Dataframe(可以复制并粘贴到文件中)
{"post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524187","visit_page_num":1}
{"post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524197","visit_page_num":2}
{"post_evar8":"32526519","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524197","visit_page_num":3}
{"post_evar8":"32526519","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524568","visit_page_num":14}
{"post_evar8":"32526519","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524568","visit_page_num":15}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524599","visit_page_num":18}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524599","visit_page_num":19}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524599","visit_page_num":20}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524599","visit_page_num":21}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590525921","visit_page_num":26}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590525921","visit_page_num":27}
{"post_evar8":"32526519","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590525921","visit_page_num":28}
<
3条答案
按热度按时间nqwrtyyt1#
不,不是虫子。
因为申请
partitionBy
在post_visid_high_low
在数据框中具有相同值的列将把数据框中的整个数据视为一个分区,您将按该分区应用排序hit_time_gmt
降序,最终结果将按如下顺序排列。根据上述结果
post_evar8
是32526519
相同的值将替换为中的其他值post_evar8
列。在这个函数中添加更多不同的值
post_visid_high_low
列尝试运行相同的代码并检查。os8fio9y2#
-更新更多示例:
4bbkushb3#
->
下面是一个唯一命中时间gmt(黄色框)值的示例,后评估8看起来正确(红色框)。
下面是修改命中时间gmt(黄色框)的示例,因此有2个相似的(1590525922)和1个唯一的(1590525921)。中间的post_evar8值从184545857更新为32526519(红色框)。这是错误的。
在这个窗口函数中,我只想更新post\ evar8中的空值(不是已经填充的值)。在所有情况下,这看起来都是正确的(绿框)。hit\u time\u gmt只是提供命令,为什么hit\u time\u gmt中的值会更改post\u evar8的值(在红色框中)?