pyspark 合并双类型列后从数组中删除空值

hmtdttj4  于 2023-05-16  发布在  Spark
关注(0)|答案(3)|浏览(75)

PySpark df

+---------+----+----+----+----+----+----+----+----+----+                        
|partition|   1|   2|   3|   4|   5|   6|   7|   8|   9|
+---------+----+----+----+----+----+----+----+----+----+
|        7|null|null|null|null|null|null| 0.7|null|null|
|        1| 0.2| 0.1| 0.3|null|null|null|null|null|null|
|        8|null|null|null|null|null|null|null| 0.8|null|
|        4|null|null|null| 0.4| 0.5| 0.6|null|null| 0.9|
+---------+----+----+----+----+----+----+----+----+----+

我将其中的9个右列合并在一起:

+---------+--------------------+                                                
|partition|            vec_comb|
+---------+--------------------+
|        7|      [,,,,,,,, 0.7]|
|        1|[,,,,,, 0.1, 0.2,...|
|        8|      [,,,,,,,, 0.8]|
|        4|[,,,,, 0.4, 0.5, ...|
+---------+--------------------+

如何从vec_comb列的数组中删除NullTypes

预期输出:

+---------+--------------------+                                                
|partition|            vec_comb|
+---------+--------------------+
|        7|               [0.7]|
|        1|      [0.1, 0.2,0.3]|
|        8|               [0.8]|
|        4|[0.4, 0.5, 0.6, 0,9]|
+---------+--------------------+

我试过(显然是错的,但我无法理解):

def clean_vec(array):
    new_Array = []
    for element in array:
        if type(element) == FloatType():
            new_Array.append(element)
    return new_Array

udf_clean_vec = F.udf(f=(lambda c: clean_vec(c)), returnType=ArrayType(FloatType()))
df = df.withColumn('vec_comb_cleaned', udf_clean_vec('vec_comb'))
n3schb8v

n3schb8v1#

Spark 3.4+

F.array_compact('vec_comb')

完整示例:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [([None, None, None, 0.7],),
     ([None, None, 0.1, 0.2, 0.3, None],),
     ([None, None, None, 0.8, None],),
     ([None, 0.4, 0.5, 0.6, None, 0.9],)],
    ['vec_comb'])
df.show(truncate=0)
# +---------------------------------+
# |vec_comb                         |
# +---------------------------------+
# |[null, null, null, 0.7]          |
# |[null, null, 0.1, 0.2, 0.3, null]|
# |[null, null, null, 0.8, null]    |
# |[null, 0.4, 0.5, 0.6, null, 0.9] |
# +---------------------------------+

df = df.withColumn('vec_comb', F.array_compact('vec_comb'))

df.show()
# +--------------------+
# |            vec_comb|
# +--------------------+
# |               [0.7]|
# |     [0.1, 0.2, 0.3]|
# |               [0.8]|
# |[0.4, 0.5, 0.6, 0.9]|
# +--------------------+
omtl5h9j

omtl5h9j2#

你可以使用高阶函数filter来删除null元素:

import pyspark.sql.functions as F

df2 = df.withColumn('vec_comb_cleaned', F.expr('filter(vec_comb, x -> x is not null)'))

df2.show()
+---------+--------------------+--------------------+
|partition|            vec_comb|    vec_comb_cleaned|
+---------+--------------------+--------------------+
|        7|      [,,,,,, 0.7,,]|               [0.7]|
|        1|[0.2, 0.1, 0.3,,,...|     [0.2, 0.1, 0.3]|
|        8|      [,,,,,,, 0.8,]|               [0.8]|
|        4|[,,, 0.4, 0.5, 0....|[0.4, 0.5, 0.6, 0.9]|
+---------+--------------------+--------------------+

你可以使用一个UDF,但它会更慢,例如。

udf_clean_vec = F.udf(lambda x: [i for i in x if i is not None], 'array<float>')
df2 = df.withColumn('vec_comb_cleaned', udf_clean_vec('vec_comb'))
zpgglvta

zpgglvta3#

不使用特定于pyspark的特性,您也可以通过filter输出NaN来创建list

df['vec_comb'] = df.iloc[:, 1:10].apply(lambda r: list(filter(pd.notna, r)) , axis=1)
df

# Output:
   partition     1     2     3     4     5     6     7     8     9              vec_comb
0          7   NaN   NaN   NaN   NaN   NaN   NaN   0.7   NaN   NaN                 [0.7]
1          1   0.2   0.1   0.3   NaN   NaN   NaN   NaN   NaN   NaN       [0.2, 0.1, 0.3]
2          8   NaN   NaN   NaN   NaN   NaN   NaN   NaN   0.8   NaN                 [0.8]
3          4   NaN   NaN   NaN   0.4   0.5   0.6   NaN   NaN   0.9  [0.4, 0.5, 0.6, 0.9]

并通过仅选择所需的两个列来删除旧列:

df = df[['partition', 'vec_comb']]
df

# Output:
   partition              vec_comb
0          7                 [0.7]
1          1       [0.2, 0.1, 0.3]
2          8                 [0.8]
3          4  [0.4, 0.5, 0.6, 0.9]

相关问题