如何在pyspark中删除数组类型(MapType(StringType(),StringType()))的列中的重复项

e5nszbig  于 2023-03-01  发布在  Spark
关注(0)|答案(1)|浏览(201)

我在 Dataframe 中有列,如下所示:
| 色谱柱A|柱B|
| - ------|- ------|
| 一百二十三|[{"测试":"手动","百分比":"50 "},{"测试":"自动"、"百分比":"80 "},{"测试"-〉"手动","百分比":"50 "},{"测试":"手动"、"百分比":"50 "}]|
| 四百五十六|[{"测试":"手动","百分比":"50 "},{"测试"-〉"自动","百分比":"25 "},{"测试":"手动"、"百分比":"50 "}]|
是否有任何方法可以删除列B中的重复项,以便生成的列应该如下所示:
| 色谱柱A|柱B|
| - ------|- ------|
| 一百二十三|[{"测试"-〉"手动","百分比"-〉" 50 "},{"测试"-〉"自动","百分比"-〉" 80 "}]|
| 四百五十六|[{"测试"-〉"手动","百分比"-〉" 50 "},{"测试"-〉"自动","百分比"-〉" 25 "}]|
我试过使用distinct(),udf和array_distinct.你能帮我一下吗?

ruarlubt

ruarlubt1#

由于pyspark无法比较MapType的唯一性,我们需要依次执行以下操作:

import pyspark.sql.functions as F

df.select(df['Column A'], F.transform(
    F.array_distinct(F.transform(df['Column B'], F.map_entries)),
    F.map_from_entries).alias('Column B')
          ).show(truncate=False)
+--------+--------------------------------------------------------------------------+
|Column A|Column B                                                                  |
+--------+--------------------------------------------------------------------------+
|123     |[{Test -> Manual, percentage -> 50}, {Test -> Automate, percentage -> 80}]|
|456     |[{Test -> Manual, percentage -> 50}, {Test -> Automate, percentage -> 25}]|
+--------+--------------------------------------------------------------------------+

相关问题