无法替换Pyspark中数组列的特定值

vecaoik1 于 2022-11-01 发布在 Spark

关注(0)|答案(2)|浏览(115)

我的DF中有一列的数据类型为：

testcolumn:array  
--element: struct
-----id:integer   
-----configName: string 
-----desc:string  
-----configparam:array
--------element:map  
-------------key:string
-------------value:string

测试柱

行1：

[{"id":1,"configName":"test1","desc":"Ram1","configparam":[{"removeit":"[]"}]},
{"id":2,"configName":"test2","desc":"Ram2","configparam":[{"removeit":"[]"}]},
{"id":3,"configName":"test3","desc":"Ram1","configparam":[{"paramId":"4","paramvalue":"200"}]}]

第二行：

[{"id":11,"configName":"test11","desc":"Ram11","configparam":[{"removeit":"[]"}]},
{"id":33,"configName":"test33","desc":"Ram33","configparam":[{"paramId":"43","paramvalue":"300"}]},
{"id":6,"configName":"test26","desc":"Ram26","configparam":[{"removeit":"[]"}]},
{"id":93,"configName":"test93","desc":"Ram93","configparam":[{"paramId":"93","paramvalue":"3009"}]}
]

我要移除的配置参数为**“配置参数”：[{“移除”：“[]"}]至“配置参数”：[]**

预期输出：

输出列

行1：

[{"id":1,"configName":"test1","desc":"Ram1","configparam":[]},
{"id":2,"configName":"test2","desc":"Ram2","configparam":[]},
{"id":3,"configName":"test3","desc":"Ram1","configparam":[{"paramId":"4","paramvalue":"200"}]}]

第二行：

[{"id":11,"configName":"test11","desc":"Ram11","configparam":[]},
{"id":33,"configName":"test33","desc":"Ram33","configparam":[{"paramId":"43","paramvalue":"300"}]},
{"id":6,"configName":"test26","desc":"Ram26","configparam":[]},
{"id":93,"configName":"test93","desc":"Ram93","configparam":[{"paramId":"93","paramvalue":"3009"}]}
]

我已经尝试了这段代码，但它没有给我输出：

test=df.withColumn('outputcolumn',F.expr("translate"(testcolumn,x-> replace(x,':[{"removeit":"[]"}]','[]')))

如果有人能帮助我，那就太好了。

pyspark

来源：https://stackoverflow.com/questions/74196479/unable-to-replace-a-particular-value-from-an-array-column-in-pyspark

2条答案

按热度按时间

qv7cva1a1#

您必须执行一连串的爆炸、筛选和groupBy作业才能达成此目的。
首先，展开数组/结构/Map列以到达嵌套列：

df = df.withColumn("id", F.col("testcolumn")["id"])
df = df.withColumn("configName", F.col("testcolumn")["configName"])
df = df.withColumn("desc", F.col("testcolumn")["desc"])
df = df.withColumn("configparam_exploded", F.explode(F.col("testcolumn")["configparam"]))
df = df.select(df.columns + [F.explode(F.col("configparam_exploded"))])

+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
|testcolumn                                           |id |configName|desc|configparam_exploded             |key       |value|
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
|{1, test1, Ram1, [{removeit -> []}]}                 |1  |test1     |Ram1|{removeit -> []}                 |removeit  |[]   |
|{2, test2, Ram2, [{removeit -> []}]}                 |2  |test2     |Ram2|{removeit -> []}                 |removeit  |[]   |
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3  |test3     |Ram1|{paramId -> 4, paramvalue -> 200}|paramId   |4    |
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3  |test3     |Ram1|{paramId -> 4, paramvalue -> 200}|paramvalue|200  |
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+

然后，根据需要筛选数据：

df = df.filter((F.col("key") != "removeit") | (F.col("value") != "[]"))

+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
|testcolumn                                           |id |configName|desc|configparam_exploded             |key       |value|
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3  |test3     |Ram1|{paramId -> 4, paramvalue -> 200}|paramId   |4    |
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3  |test3     |Ram1|{paramId -> 4, paramvalue -> 200}|paramvalue|200  |
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+

最后，groupBy将所有单独的列恢复为原始填充：

df = df.withColumn("configparam_map", F.map_from_entries(F.array(F.struct("key", "value"))))
df = df.groupBy(["id", "configName", "desc"]).agg(F.collect_list("configparam_map").alias("configparam"))
df = df.withColumn("testcolumn", F.struct("id", "configName", "desc", "configparam"))
df = df.drop("id", "configName", "desc", "configparam")

+-------------------------------------------------------+
|testcolumn                                             |
+-------------------------------------------------------+
|{3, test3, Ram1, [{paramId -> 4}, {paramvalue -> 200}]}|
+-------------------------------------------------------+

用于重现问题的示例数据集：

schema = StructType([StructField('testcolumn', StructType([StructField('id', IntegerType(), True), StructField('configName', StringType(), True), StructField('desc', StringType(), True), StructField('configparam', ArrayType(MapType(StringType(), StringType(), True), True), True)]), True)])

data = [
  Row(Row(1, "test1", "Ram1", [{"removeit":"[]"}])),
  Row(Row(2, "test2", "Ram2", [{"removeit":"[]"}])),
  Row(Row(3, "test3", "Ram1", [{"paramId":"4","paramvalue":"200"}]))    
]

df = spark.createDataFrame(data = data, schema = schema)

赞(0）回复(0）举报 2022-11-01

798qvoo82#

您的testcolumn是一个struct数组，因此无法按原样执行字符串操作。
你可以这样做。当configparam * 包含 * 一个键“removeit”时，这将完全清空configparam。
示例：

"configparam":[{"removeit":[], "otherparam": "value"}] -> "configparam": []

Spark3.1.0+

array_has_remove = lambda y: ~F.array_contains(F.map_keys(y), 'removeit')

df = (df.withColumn('outputcolumn', 
          F.transform('testcolumn', 
              lambda x: x.withField('configparam', 
                  F.filter(x['configparam'], array_has_remove)
              )
          )
     ))

参考编号：withField、filter、array_contains、map_keys
〈Spark3.1.0
我试过不使用explode，但是这很复杂。如果你不喜欢这个复杂，你可以尝试使用explode和聚合。

df = (# extract configparam to a column for easier access.
      df.withColumn('configparam', F.expr('transform(testcolumn, x -> x.configparam)'))
        # Return empty array if there is a "removeit" otherwise return the original object.
        .withColumn('configparam', 
            F.expr('transform(configparam, x -> 
                        case when array_contains(map_keys(x[0]), "removeit") then array() 
                        else x end)'))
        # Patch the transformed configparam with the rest of testcolumn
        .withColumn('outputcolumn', 
            F.expr('transform(testcolumn, (x, i) -> struct(x.id, x.configName, x.desc, configparam[i] as configparam))'))
        .drop('configparam'))

测试结果

Row(testcolumn=[Row(id=1, configName='test1', desc='Ram1', configparam=[{'removeit': '[]'}]), Row(id=2, configName='test2', desc='Ram2', configparam=[{'removeit': '[]'}]), Row(id=3, configName='test3', desc='Ram1', configparam=[{'paramId': '4', 'paramvalue': '200'}])], 
  outputcolumn=[Row(id=1, configName='test1', desc='Ram1', configparam=[]), Row(id=2, configName='test2', desc='Ram2', configparam=[]), Row(id=3, configName='test3', desc='Ram1', configparam=[{'paramId': '4', 'paramvalue': '200'}])])

赞(0）回复(0）举报 2022-11-01

我来回答

无法替换Pyspark中数组列的特定值

2条答案

相关问题

热门标签

最新问答