python在sparkDataframe中展平嵌套数组

kg7wmglp 于 2021-07-12 发布在 Spark

关注(0)|答案(2)|浏览(406)

我正在读一些json，来自：

{"a": [{"b": {"c": 1, "d": 2}}]}

也就是说，数组项是不必要的嵌套。现在，因为这发生在数组中，所以在如何展平sparkDataframe中的结构中给出的答案是什么？不要直接申请。
这是Dataframe在解析时的外观：

root
|-- a: array
|    |-- element: struct
|    |    |-- b: struct
|    |    |    |-- c: integer
|    |    |    |-- d: integer

我希望将Dataframe转换为：

root
|-- a: array
|    |-- element: struct
|    |    |-- b_c: integer
|    |    |-- b_d: integer

如何对数组中的列进行别名处理以有效地取消对它的声明？

python apache-spark pyspark

来源：https://stackoverflow.com/questions/66476940/flatten-nested-array-in-spark-dataframe

2条答案

按热度按时间

krugob8w1#

你可以用 transform :

df2 = df.selectExpr("transform(a, x -> struct(x.b.c as b_c, x.b.d as b_d)) as a")

赞(0）回复(0）举报 2021-07-12

xfb7svmp2#

使用已接受答案中提供的方法，我编写了一个递归取消对Dataframe的嵌套的函数（也递归到嵌套的数组中）：

from pyspark.sql.types import ArrayType, StructType

def flatten(df, sentinel="x"):
    def _gen_flatten_expr(schema, indent, parents, last, transform=False):
        def handle(field, last):
            path = parents + (field.name,)
            alias = (
                " as "
                + "_".join(path[1:] if transform else path)
                + ("," if not last else "")
            )
            if isinstance(field.dataType, StructType):
                yield from _gen_flatten_expr(
                    field.dataType, indent, path, last, transform
                )
            elif (
                isinstance(field.dataType, ArrayType) and
                isinstance(field.dataType.elementType, StructType)
            ):
                yield indent, "transform("
                yield indent + 1, ".".join(path) + ","
                yield indent + 1, sentinel + " -> struct("
                yield from _gen_flatten_expr(
                    field.dataType.elementType, 
                    indent + 2, 
                    (sentinel,), 
                    True, 
                    True
                )
                yield indent + 1, ")"
                yield indent, ")" + alias
            else:
                yield (indent, ".".join(path) + alias)

        try:
            *fields, last_field = schema.fields
        except ValueError:
            pass
        else:
            for field in fields:
                yield from handle(field, False)
            yield from handle(last_field, last)

    lines = []
    for indent, line in _gen_flatten_expr(df.schema, 0, (), True):
        spaces = " " * 4 * indent
        lines.append(spaces + line)

    expr = "struct(" + "\n".join(lines) + ") as " + sentinel
    return df.selectExpr(expr).select(sentinel + ".*")

赞(0）回复(0）举报 2021-07-12

我来回答

python在sparkDataframe中展平嵌套数组

2条答案

相关问题

热门标签

最新问答