我在 Dataframe 中有列,如下所示:
| 色谱柱A|柱B|
| - ------|- ------|
| 一百二十三|[{"测试":"手动","百分比":"50 "},{"测试":"自动"、"百分比":"80 "},{"测试"-〉"手动","百分比":"50 "},{"测试":"手动"、"百分比":"50 "}]|
| 四百五十六|[{"测试":"手动","百分比":"50 "},{"测试"-〉"自动","百分比":"25 "},{"测试":"手动"、"百分比":"50 "}]|
是否有任何方法可以删除列B中的重复项,以便生成的列应该如下所示:
| 色谱柱A|柱B|
| - ------|- ------|
| 一百二十三|[{"测试"-〉"手动","百分比"-〉" 50 "},{"测试"-〉"自动","百分比"-〉" 80 "}]|
| 四百五十六|[{"测试"-〉"手动","百分比"-〉" 50 "},{"测试"-〉"自动","百分比"-〉" 25 "}]|
我试过使用distinct(),udf和array_distinct.你能帮我一下吗?
1条答案
按热度按时间ruarlubt1#
由于pyspark无法比较
MapType
的唯一性,我们需要依次执行以下操作: