Pyspark -将具有数组结构的列扩展为新列

cbjzeqam  于 2023-01-29  发布在  Spark
关注(0)|答案(2)|浏览(104)

我有一个DataFrame,它有一个结构类型的单列,包含一个数组。

users_tp_df.printSchema()
root
 |-- x: struct (nullable = true)
 |    |-- ActiveDirectoryName: string (nullable = true)
 |    |-- AvailableFrom: string (nullable = true)
 |    |-- AvailableFutureAllocation: long (nullable = true)
 |    |-- AvailableFutureHours: double (nullable = true)
 |    |-- CreateDate: string (nullable = true)
 |    |-- CurrentAllocation: long (nullable = true)
 |    |-- CurrentAvailableHours: double (nullable = true)
 |    |-- CustomFields: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Name: string (nullable = true)
 |    |    |    |-- Type: string (nullable = true)
 |    |    |    |-- Value: string (nullable = true)

我正在尝试将CustomFields数组列转换为3三列:
1.国家;
1.外部;
1.服务。
例如,我有这些值:

并且该行除外的最终 Dataframe 输出将是:

有谁能帮我实现这个目标吗?
谢谢大家!

j5fpnvbx

j5fpnvbx1#

这是可行的:

initial_expansion= df.withColumn("id", F.monotonically_increasing_id()).select("id","x.*");

final_df = initial_expansion\
     .join(initial_expansion.withColumn("CustomFields", F.explode("CustomFields"))\
           .select("*", "CustomFields.*")\
           .groupBy("id").pivot("Name").agg(F.first("Value")), \
        "id").drop("CustomFields")

样品输入:儿子-{'x': {'CurrentAvailableHours': 2, 'CustomFields': [{'Name': 'Country', 'Value': 'Italy'}, {'Name': 'Service', 'Value':'Dev'}]}}

输入结构:

root
 |-- x: struct (nullable = true)
 |    |-- CurrentAvailableHours: integer (nullable = true)
 |    |-- CustomFields: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Name: string (nullable = true)
 |    |    |    |-- Value: string (nullable = true)

输出:

输出结构(可以删除ID):

root
 |-- id: long (nullable = false)
 |-- CurrentAvailableHours: integer (nullable = true)
 |-- Country: string (nullable = true)
 |-- Service: string (nullable = true)
yi0zb3m4

yi0zb3m42#

考虑下面的模型结构,类似于你的例子中的模型结构,你可以使用内联函数以sql的方式完成它:

with alpha as (
select named_struct("alpha", "abc", "beta", 2.5, "gamma", 3, "delta"
                    , array(  named_struct("a", "x", "b", "y", "c", "z")
                            , named_struct("a", "xx", "b", "yy", "c","zz"))
                   ) root
)
select root.alpha, root.beta, root.gamma, inline(root.delta) as (a, b, c) 
from alpha

结果:

实体模型结构:

相关问题