我有一个Spark框架spdf
,数据如下所示:
player_name team_history
John [{Rangers, Center, Active}, {Blackhawks, Center, Former}, {Kings, Center, Former}],
Bob [{Devils, Defense, Active}, {Maple Leafs, Defense, Former}, {Canadiens, Defense, Former}]
模式是:
hockey_schema = StructType([
StructField("player_name", StringType(), True),
StructField("team_history", ArrayType(
StructType([
StructField("team", StringType(), True),
StructField("position", StringType(), True),
StructField("status", StringType(), True),
])), True)
])
JSON看起来像这样:
[{ "player_name" : "John", "team_history" : [ { "team" : "Rangers", "position" : "Center", "status" : "Active" }, { "team" : "Blackhawks", "position" : "Center", "status" : "Former"}, { "team" : "Kings", "position" : "Center", "status" : "Former"} ] },
{ "player_name" : "Bob", "team_history" : [ { "team" : "Devils", "position" : "Defense", "status" : "Active" }, { "team" : "Maple Leafs", "position" : "Defence", "status" : "Former"}, { "team" : "Canadiens", "position" : "Defense", "status" : "Former"} ] }]
我想“分解”team_history
列的内容,以创建一个名为df_exploded
的新框架,其中列只包含team
和status
**,如下所示:
team status
Rangers Active
Blackhawks Former
Kings Former
Devils Active
Maple Leafs Former
Canadiens Former
如何使用Pyspark中的explode()
函数创建所需的df_exploded
框架?
谢谢你,谢谢
1条答案
按热度按时间enyaitl31#
使用
explode
并从结构体中提取您感兴趣的值似乎可以做到这一点: