我正在尝试编写一个sql查询,以便在pyspark中使用它从pyspark df中清除信息。我要修改的df如下所示:
hashed_customer firstname lastname email order_id status timestamp
eater 1_uuid 1_firstname 1_lastname 1_email 12345 OPTED_IN 2020-05-14 20:45:15
eater 2_uuid 2_firstname 2_lastname 2_email 23456 OPTED_IN 2020-05-14 20:29:22
eater 3_uuid 3_firstname 3_lastname 3_email 34567 OPTED_IN 2020-05-14 19:31:55
eater 4_uuid 4_firstname 4_lastname 4_email 45678 OPTED_IN 2020-05-14 17:49:27
我需要从customer\u temp\u tb表中删除另一个pyspark df,如下所示:
hashed_customer eaterstatus
eater 1_uuid OPTED_OUT
eater 3_uuid OPTED_OUT
我正在尝试编写一个sql查询以在pyspark中使用,如果客户在第二个表中,它将从第一个表中删除firstname、lastname和email。有点像:
UPDATE customer_temp_tb
SET firstname="", lastname="", email=""
WHERE hashed_eater_uuid IN
(SELECT hashed_eater_uuid FROM opt_out_temp_tb)
因此最终结果如下:
hashed_customer firstname lastname email order_id status timestamp
eater 1_uuid NaN NaN NaN 12345 OPTED_IN 2020-05-14 20:45:15
eater 2_uuid 2_firstname 2_lastname 2_email 23456 OPTED_IN 2020-05-14 20:29:22
eater 3_uuid NaN NaN NaN 34567 OPTED_IN 2020-05-14 19:31:55
eater 4_uuid 4_firstname 4_lastname 4_email 45678 OPTED_IN 2020-05-14 17:49:27
我的问题是pyspark不支持更新。还有别的选择吗?
1条答案
按热度按时间ubbxdtey1#
我认为,可以将列更新为null或string empty“”,而不是delete。