如何从pyspark中的dataframe中的分解值附加值

brjng4g3  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(347)

数据是

data = [{"_id":"Inst001","Type":"AAAA", "Model001":[{"_id":"Mod001", "Name": "FFFF"},
                                                    {"_id":"Mod0011", "Name": "FFFF4"}]},
        {"_id":"Inst002", "Type":"BBBB", "Model001":[{"_id":"Mod002", "Name": "DDD"}]}]

需要按如下方式构建Dataframe
pid\U IDNAMEINST001MOD001FFFFINST001MOD0011FFF4INST002MOD002DDD
我的方法是
需要爆炸“model001”
然后需要将主id附加到此分解的Dataframe。但是如何在pyspark中完成这个附加呢?
pyspark中是否有解决上述问题的内置方法?

fnx2tebb

fnx2tebb1#

创建一个具有适当模式的Dataframe,然后 inlineModel001 列:

df = spark.createDataFrame(
    data, 
    '_id string, Type string, Model001 array<struct<_id:string, Name:String>>'
).selectExpr('_id as pid', 'inline(Model001)')

df.show(truncate=False)
+-------+-------+-----+
|pid    |_id    |Name |
+-------+-------+-----+
|Inst001|Mod001 |FFFF |
|Inst001|Mod0011|FFFF4|
|Inst002|Mod002 |DDD  |
+-------+-------+-----+

相关问题