python 在pyspark Dataframe 中将深度嵌套字段上移一级

vsaztqbk  于 2023-01-16  发布在  Python
关注(0)|答案(1)|浏览(124)

我有一个从XML创建的pyspark Dataframe 。由于XML的结构化方式,我在 Dataframe 的模式中有一个额外的,不必要的嵌套级别。
当前 Dataframe 的架构:

root
|-- a: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- movies: struct (nullable = true)
|    |    |    |-- movie: array (nullable = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- b: string (nullable = true)
|    |    |    |    |    |-- c: string (nullable = true)
|    |    |    |    |    |-- d: integer (nullable = true)
|    |    |    |    |    |-- e: string (nullable = true)
|    |    |-- f: string (nullable = true)
|    |    |-- g: string (nullable = true)

我尝试用movies结构体下面的movie数组替换movies结构体,如下所示:

root
|-- a: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- movies: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- b: string (nullable = true)
|    |    |    |    |-- c: string (nullable = true)
|    |    |    |    |-- d: integer (nullable = true)
|    |    |    |    |-- e: string (nullable = true)
|    |    |-- f: string (nullable = true)
|    |    |-- g: string (nullable = true)

我最接近的是用:

from pyspark.sql import functions as F

df.withColumn("a", F.transform('a', lambda x: x.withField("movies_new", F.col("a.movies.movie"))))

这导致以下模式:

root
|-- a: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- movies: struct (nullable = true)
|    |    |    |-- movie: array (nullable = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- b: string (nullable = true)
|    |    |    |    |    |-- c: string (nullable = true)
|    |    |    |    |    |-- d: integer (nullable = true)
|    |    |    |    |    |-- e: string (nullable = true)
|    |    |-- f: string (nullable = true)
|    |    |-- g: string (nullable = true)
|    |    |-- movies_new: array (nullable = true)
|    |    |    |-- element: array (containsNull = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- b: string (nullable = true)
|    |    |    |    |    |-- c: string (nullable = true)
|    |    |    |    |    |-- d: integer (nullable = true)
|    |    |    |    |    |-- e: string (nullable = true)

我理解为什么会发生这种情况,但我想如果我从来没有从"a"中提取嵌套数组,它可能不会成为数组的数组。
有什么建议吗?

n9vozmp4

n9vozmp41#

其逻辑是:

  • 分解数组"a"。
  • 将新结构体重新计算为(movies.movie,f,g)
  • 将"a"作为数组收集回来。
df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
                                 df.a.movies.getField("movie").alias("movies"), \
                                 df.a.f.alias("f"), \
                                 df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))

完整工作代码:

import pyspark.sql.functions as F

df = spark.createDataFrame(data=[
    [[(([("b1", "c1", "d1", "e1")],), "f1", "g1")]]
], schema="a array<struct<movies struct<movie array<struct<b string, c string, d string, e string>>>, f string, g string>>")

df.printSchema()
# df.show(truncate=False)

df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
                                 df.a.movies.getField("movie").alias("movies"), \
                                 df.a.f.alias("f"), \
                                 df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))

df.printSchema()
# df.show(truncate=False)

之前的输出架构:

root
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- movies: struct (nullable = true)
 |    |    |    |-- movie: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- b: string (nullable = true)
 |    |    |    |    |    |-- c: string (nullable = true)
 |    |    |    |    |    |-- d: string (nullable = true)
 |    |    |    |    |    |-- e: string (nullable = true)
 |    |    |-- f: string (nullable = true)
 |    |    |-- g: string (nullable = true)

之后的输出架构:

root
 |-- a: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- movies: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- b: string (nullable = true)
 |    |    |    |    |-- c: string (nullable = true)
 |    |    |    |    |-- d: string (nullable = true)
 |    |    |    |    |-- e: string (nullable = true)
 |    |    |-- f: string (nullable = true)
 |    |    |-- g: string (nullable = true)

相关问题