我的DF中有一列的数据类型为:
testcolumn:array
--element: struct
-----id:integer
-----configName: string
-----desc:string
-----configparam:array
--------element:map
-------------key:string
-------------value:string
测试柱
行1:
[{"id":1,"configName":"test1","desc":"Ram1","configparam":[{"removeit":"[]"}]},
{"id":2,"configName":"test2","desc":"Ram2","configparam":[{"removeit":"[]"}]},
{"id":3,"configName":"test3","desc":"Ram1","configparam":[{"paramId":"4","paramvalue":"200"}]}]
第二行:
[{"id":11,"configName":"test11","desc":"Ram11","configparam":[{"removeit":"[]"}]},
{"id":33,"configName":"test33","desc":"Ram33","configparam":[{"paramId":"43","paramvalue":"300"}]},
{"id":6,"configName":"test26","desc":"Ram26","configparam":[{"removeit":"[]"}]},
{"id":93,"configName":"test93","desc":"Ram93","configparam":[{"paramId":"93","paramvalue":"3009"}]}
]
我要移除的配置参数为**“配置参数”:[{“移除”:“[]"}]至“配置参数”:[]**
预期输出:
输出列
行1:
[{"id":1,"configName":"test1","desc":"Ram1","configparam":[]},
{"id":2,"configName":"test2","desc":"Ram2","configparam":[]},
{"id":3,"configName":"test3","desc":"Ram1","configparam":[{"paramId":"4","paramvalue":"200"}]}]
第二行:
[{"id":11,"configName":"test11","desc":"Ram11","configparam":[]},
{"id":33,"configName":"test33","desc":"Ram33","configparam":[{"paramId":"43","paramvalue":"300"}]},
{"id":6,"configName":"test26","desc":"Ram26","configparam":[]},
{"id":93,"configName":"test93","desc":"Ram93","configparam":[{"paramId":"93","paramvalue":"3009"}]}
]
我已经尝试了这段代码,但它没有给我输出:
test=df.withColumn('outputcolumn',F.expr("translate"(testcolumn,x-> replace(x,':[{"removeit":"[]"}]','[]')))
如果有人能帮助我,那就太好了。
2条答案
按热度按时间qv7cva1a1#
您必须执行一连串的爆炸、筛选和groupBy作业才能达成此目的。
首先,展开数组/结构/Map列以到达嵌套列:
然后,根据需要筛选数据:
最后,groupBy将所有单独的列恢复为原始填充:
用于重现问题的示例数据集:
798qvoo82#
您的
testcolumn
是一个struct数组,因此无法按原样执行字符串操作。你可以这样做。当
configparam
* 包含 * 一个键“removeit”时,这将完全清空configparam
。示例:
Spark3.1.0+
参考编号:withField、filter、array_contains、map_keys
〈Spark3.1.0
我试过不使用
explode
,但是这很复杂。如果你不喜欢这个复杂,你可以尝试使用explode
和聚合。测试结果