如何在Pyspark中使用“explode”从数组中提取选定的元素

sg3maiej  于 2023-10-15  发布在  Spark
关注(0)|答案(1)|浏览(90)

我有一个Spark框架spdf,数据如下所示:

player_name    team_history

John           [{Rangers, Center, Active}, {Blackhawks, Center, Former}, {Kings, Center, Former}],
Bob            [{Devils, Defense, Active}, {Maple Leafs, Defense, Former}, {Canadiens, Defense, Former}]

模式是:

hockey_schema = StructType([

     StructField("player_name", StringType(), True),

     StructField("team_history", ArrayType(
         StructType([
             StructField("team",      StringType(), True),
             StructField("position",  StringType(), True),
             StructField("status",    StringType(), True),
         ])), True)

   ])

JSON看起来像这样:

[{ "player_name" : "John", "team_history" : [ { "team" : "Rangers", "position" : "Center", "status" : "Active" }, { "team" : "Blackhawks", "position" : "Center", "status" : "Former"}, { "team" : "Kings", "position" : "Center", "status" : "Former"} ] },

{ "player_name" : "Bob", "team_history" : [ { "team" : "Devils", "position" : "Defense", "status" : "Active" }, { "team" : "Maple Leafs", "position" : "Defence", "status" : "Former"}, { "team" : "Canadiens", "position" : "Defense", "status" : "Former"} ] }]

我想“分解”team_history列的内容,以创建一个名为df_exploded的新框架,其中列只包含teamstatus**,如下所示:

team          status
Rangers       Active
Blackhawks    Former
Kings         Former
Devils        Active
Maple Leafs   Former
Canadiens     Former

如何使用Pyspark中的explode()函数创建所需的df_exploded框架?
谢谢你,谢谢

enyaitl3

enyaitl31#

使用explode并从结构体中提取您感兴趣的值似乎可以做到这一点:

df\
  .select(F.explode("team_history").alias("s"))\
  .select("s.team", "s.status")\
  .show()
+-----------+------+
|team       |status|
+-----------+------+
|Rangers    |Active|
|Blackhawks |Former|
|Kings      |Former|
|Devils     |Active|
|Maple Leafs|Former|
|Canadiens  |Former|
+-----------+------+

相关问题