我们可以在pyspark中为自定义模式的列提供默认值吗

csbfibhn 于 2023-01-05 发布在 Apache

关注(0)|答案(2)|浏览(139)

我是新来的pyspark，请让我知道，如果你有一个解决这个问题的方法
我在pyspark中创建了一个自定义模式，如下所示

from pyspark.sql import SparkSession
    from pyspark.sql.types import StructType,StructField, StringType, IntegerType
    
    structureSchema = StructType([
             StructField('col1', StringType(), True),
             StructField('col2', StringType(), True),
             StructField('col3', StringType(), True)   ,
               ])

我有一个包含多个JSON的文本文件，类似于

{'col1':'abc','col2':'abc1','col3':'qwe'}
{'col1':'abc','col2':'abc1'}
{'col1':'abc''col3':'qwe'} .

当我将这个文件加载到创建的自定义模式中时，它会用空值填充缺少的列条目。

df=spark.read.schema(structureSchema).json(fpath)   

col1      col2     col3
abc       abc1     qwe
abc       abc1     null
abc       null     null

有没有一种方法可以用默认值填充它们--“NoValueReceived”而不是“null”，如下所示

col1      col2     col3
abc       abc1     qwe
abc       abc1     NoValueReceived
abc       NoValueReceived  NoValueReceived

apache-spark

来源：https://stackoverflow.com/questions/75003785/can-we-provide-default-values-to-a-column-of-a-custom-schema-in-pyspark

2条答案

按热度按时间

1tuwyuhd1#

在PySpark中，DataFrame.fillna（）或DataFrameNaFunctions.fill（）用于将所有或选定的多个DataFrame列上的NULL/None值替换为零（0）、空字符串、空格或任何常量文字值。
参考：https://sparkbyexamples.com/pyspark/pyspark-fillna-fill-replace-null-values/

df = spark.read.schema(structureSchema).json(fpath)
df = df.na.fill(value="NoValueReceived",subset=["col1", "col2", "col3"])

赞(0）回复(0）举报 2023-01-05

rekjcdws2#

在这种情况下，我认为您可以检查传入数据中是否有该列，例如

from pyspark.sql.functions import coalesce

# Read in the JSON file with a pre-defined schema
df = spark.read.json("path/to/file.json", schema=schema)

# Fill in missing values with a default value
df = df.withColumn("column_name", coalesce(df["column_name"], lit("default_value")))

赞(0）回复(0）举报 2023-01-05

我来回答

我们可以在pyspark中为自定义模式的列提供默认值吗

2条答案

相关问题

热门标签

最新问答