如何在pyspark中使用coalesce将空值替换为某个值

jvlzgdj9 于 2023-08-02 发布在 Spark

关注(0)|答案(1)|浏览(179)

我有两个文件：- orders_renamed.csv，customers.csv我用完全的外部连接连接它们，然后删除相同的列（customer_id）。我想在“order_id”列中将null值替换为“-1”。
我试过这个：

from pyspark.sql.functions import regexp_extract, monotonically_increasing_id, unix_timestamp, from_unixtime, coalesce from pyspark.sql.types import IntegerType, StructField, StructType, StringType

ordersDf = spark.read.format("csv").option("header", True).option("inferSchema", True).option("path", "C:/Users/Lenovo/Desktop/week12/week 12 dataset/orders_renamed.csv").load()

customersDf = spark.read.format("csv").option("header", True).option("inferSchema", True).option("path", "C:/Users/Lenovo/Desktop/week12/week 12 dataset/customers.csv").load()

joinCondition1 = ordersDf.customer_id == customersDf.customer_id

joinType1 = "outer"   

joinenullreplace = ordersDf.join(customersDf, joinCondition1, joinType1).drop(ordersDf.customer_id).select("order_id", "customer_id", "customer_fname").sort("order_id").withColumn("order_id",coalesce("order_id",-1))

joinenullreplace.show(50)

字符串
正如在最后一行，我已经使用了聚结，但它给了我错误..我已经尝试了多种方法，如treting聚结作为一个表达式和应用'expr'，但它没有工作。我也用过lit，但没有用。请回复解决方案。

pyspark

来源：https://stackoverflow.com/questions/76801036/how-to-replace-null-value-with-some-value-using-coalesce-in-pyspark