pySpark替换行子集上的空值

yv5phkfx 于 2022-11-21 发布在 Spark

关注(0)|答案(2)|浏览(190)

我有一个pySpark Dataframe ，其中有我想要替换的空值I-但是，对于不同的组，要替换的值是不同的。
我的数据看起来像这样（道歉，我没有办法过去它作为文本）：

对于组A，我想用-999替换空值;而对于组B，我想用0替换空值。
目前，我将数据拆分为多个部分，然后执行df = df.fillna(-999)。
有没有更有效的方法？在伪代码中，我想沿着df = df.where(col('group') == A).fillna(lit(-999)).where(col('group') == B).fillna(lit(0))这样的方法，但当然，这是行不通的。

pyspark

来源：https://stackoverflow.com/questions/74456021/pyspark-replacing-null-value-on-subsets-of-rows

2条答案

按热度按时间

egdjgwm81#

您可以使用when：

from pyspark.sql import functions as F

# Loop over all the columns you want to fill
for col in ('Col1', 'Col2', 'Col3'):
    # compute here conditions to fill using a value or another
    fill_a = F.col(col).isNull() & (F.col('Group') == 'A')
    fill_b = F.col(col).isNull() & (F.col('Group') == 'B')

    # Fill the column based on the different conditions 
    # using nested `when` - `otherwise`.
    #
    # Do not forget to add the last `otherwise` with the original 
    # values if none of the previous conditions have been met
    filled_col = (
        F.when(fill_a, -999)
        .otherwise(
            F.when(fill_b, 0)
            .otherwise(F.col(col))
        )
    )

    # 'overwrite' the original column with the filled column
    df = df.withColumn(col, filled_col)

赞(0）回复(0）举报 2022-11-21

eimct9ow2#

另一个可能的选择是对每一列使用coalesce，并使用“filler”列保存替换值：

import pyspark.sql.functions as F

for c in ['Col1', 'Col2', 'Col3']:
  df = df.withColumn(c, F.coalesce(c, F.when(F.col('group') == 'A', -999)
                                       .when(F.col('group') == 'B', 0)))

赞(0）回复(0）举报 2022-11-21

我来回答

pySpark替换行子集上的空值

2条答案

相关问题

热门标签

最新问答