pyspark sql用null替换元素

yhived7q  于 2021-08-09  发布在  Java
关注(0)|答案(1)|浏览(381)

我正在尝试编写一个sql查询,以便在pyspark中使用它从pyspark df中清除信息。我要修改的df如下所示:

hashed_customer     firstname    lastname    email   order_id    status          timestamp
      eater 1_uuid  1_firstname  1_lastname  1_email    12345    OPTED_IN     2020-05-14 20:45:15
      eater 2_uuid  2_firstname  2_lastname  2_email    23456    OPTED_IN     2020-05-14 20:29:22
      eater 3_uuid  3_firstname  3_lastname  3_email    34567    OPTED_IN     2020-05-14 19:31:55
      eater 4_uuid  4_firstname  4_lastname  4_email    45678    OPTED_IN     2020-05-14 17:49:27

我需要从customer\u temp\u tb表中删除另一个pyspark df,如下所示:

hashed_customer    eaterstatus
   eater 1_uuid      OPTED_OUT
   eater 3_uuid      OPTED_OUT

我正在尝试编写一个sql查询以在pyspark中使用,如果客户在第二个表中,它将从第一个表中删除firstname、lastname和email。有点像:

UPDATE customer_temp_tb
SET firstname="", lastname="", email=""
WHERE hashed_eater_uuid IN
(SELECT hashed_eater_uuid FROM opt_out_temp_tb)

因此最终结果如下:

hashed_customer     firstname    lastname    email   order_id    status          timestamp
   eater 1_uuid           NaN         NaN      NaN    12345    OPTED_IN     2020-05-14 20:45:15
   eater 2_uuid   2_firstname  2_lastname  2_email    23456    OPTED_IN     2020-05-14 20:29:22
   eater 3_uuid           NaN         NaN      NaN    34567    OPTED_IN     2020-05-14 19:31:55
   eater 4_uuid   4_firstname  4_lastname  4_email    45678    OPTED_IN     2020-05-14 17:49:27

我的问题是pyspark不支持更新。还有别的选择吗?

ubbxdtey

ubbxdtey1#

我认为,可以将列更新为null或string empty“”,而不是delete。

相关问题