python 在pyspark Dataframe 中将深度嵌套字段上移一级

vsaztqbk 于 2023-01-16 发布在 Python

关注(0)|答案(1)|浏览(124)

我有一个从XML创建的pyspark Dataframe 。由于XML的结构化方式，我在 Dataframe 的模式中有一个额外的，不必要的嵌套级别。
当前 Dataframe 的架构：

root
|-- a: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- movies: struct (nullable = true)
|    |    |    |-- movie: array (nullable = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- b: string (nullable = true)
|    |    |    |    |    |-- c: string (nullable = true)
|    |    |    |    |    |-- d: integer (nullable = true)
|    |    |    |    |    |-- e: string (nullable = true)
|    |    |-- f: string (nullable = true)
|    |    |-- g: string (nullable = true)

我尝试用movies结构体下面的movie数组替换movies结构体，如下所示：

root
|-- a: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- movies: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- b: string (nullable = true)
|    |    |    |    |-- c: string (nullable = true)
|    |    |    |    |-- d: integer (nullable = true)
|    |    |    |    |-- e: string (nullable = true)
|    |    |-- f: string (nullable = true)
|    |    |-- g: string (nullable = true)

我最接近的是用：

from pyspark.sql import functions as F

df.withColumn("a", F.transform('a', lambda x: x.withField("movies_new", F.col("a.movies.movie"))))

这导致以下模式：

root
|-- a: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- movies: struct (nullable = true)
|    |    |    |-- movie: array (nullable = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- b: string (nullable = true)
|    |    |    |    |    |-- c: string (nullable = true)
|    |    |    |    |    |-- d: integer (nullable = true)
|    |    |    |    |    |-- e: string (nullable = true)
|    |    |-- f: string (nullable = true)
|    |    |-- g: string (nullable = true)
|    |    |-- movies_new: array (nullable = true)
|    |    |    |-- element: array (containsNull = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- b: string (nullable = true)
|    |    |    |    |    |-- c: string (nullable = true)
|    |    |    |    |    |-- d: integer (nullable = true)
|    |    |    |    |    |-- e: string (nullable = true)

我理解为什么会发生这种情况，但我想如果我从来没有从"a"中提取嵌套数组，它可能不会成为数组的数组。
有什么建议吗？

python

来源：https://stackoverflow.com/questions/75121999/move-deeply-nested-fields-one-level-up-in-pyspark-dataframe

1条答案

按热度按时间

n9vozmp41#

其逻辑是：

分解数组"a"。
将新结构体重新计算为（movies.movie，f，g）
将"a"作为数组收集回来。

df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
                                 df.a.movies.getField("movie").alias("movies"), \
                                 df.a.f.alias("f"), \
                                 df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))

完整工作代码：

import pyspark.sql.functions as F

df = spark.createDataFrame(data=[
    [[(([("b1", "c1", "d1", "e1")],), "f1", "g1")]]
], schema="a array<struct<movies struct<movie array<struct<b string, c string, d string, e string>>>, f string, g string>>")

df.printSchema()
# df.show(truncate=False)

df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
                                 df.a.movies.getField("movie").alias("movies"), \
                                 df.a.f.alias("f"), \
                                 df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))

df.printSchema()
# df.show(truncate=False)

之前的输出架构：

root
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- movies: struct (nullable = true)
 |    |    |    |-- movie: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- b: string (nullable = true)
 |    |    |    |    |    |-- c: string (nullable = true)
 |    |    |    |    |    |-- d: string (nullable = true)
 |    |    |    |    |    |-- e: string (nullable = true)
 |    |    |-- f: string (nullable = true)
 |    |    |-- g: string (nullable = true)

之后的输出架构：

root
 |-- a: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- movies: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- b: string (nullable = true)
 |    |    |    |    |-- c: string (nullable = true)
 |    |    |    |    |-- d: string (nullable = true)
 |    |    |    |    |-- e: string (nullable = true)
 |    |    |-- f: string (nullable = true)
 |    |    |-- g: string (nullable = true)

赞(0）回复(0）举报 2023-01-16

我来回答

python 在pyspark Dataframe 中将深度嵌套字段上移一级

1条答案

相关问题

热门标签

最新问答