我有两个文件:- orders_renamed.csv,customers.csv我用完全的外部连接连接它们,然后删除相同的列(customer_id)。我想在“order_id”列中将null值替换为“-1”。
我试过这个:
from pyspark.sql.functions import regexp_extract, monotonically_increasing_id, unix_timestamp, from_unixtime, coalesce from pyspark.sql.types import IntegerType, StructField, StructType, StringType
ordersDf = spark.read.format("csv").option("header", True).option("inferSchema", True).option("path", "C:/Users/Lenovo/Desktop/week12/week 12 dataset/orders_renamed.csv").load()
customersDf = spark.read.format("csv").option("header", True).option("inferSchema", True).option("path", "C:/Users/Lenovo/Desktop/week12/week 12 dataset/customers.csv").load()
joinCondition1 = ordersDf.customer_id == customersDf.customer_id
joinType1 = "outer"
joinenullreplace = ordersDf.join(customersDf, joinCondition1, joinType1).drop(ordersDf.customer_id).select("order_id", "customer_id", "customer_fname").sort("order_id").withColumn("order_id",coalesce("order_id",-1))
joinenullreplace.show(50)
字符串
正如在最后一行,我已经使用了聚结,但它给了我错误..我已经尝试了多种方法,如treting聚结作为一个表达式和应用'expr',但它没有工作。我也用过lit,但没有用。请回复解决方案。
1条答案
按热度按时间klr1opcd1#
字符串