Apache Spark 将json变换为多个 Dataframe

mwg9r5ms  于 2023-02-19  发布在  Apache
关注(0)|答案(1)|浏览(131)

我有多个json具有这种结构(关联可以有一个或多个对象& Charastertics并不总是具有相同数量的kv对:

{
"vl:VNETList": {
    "Template": {
        "ID": "SomeId",
        "Object": [
            {
                "ID": "my_first_id",
                "Context": {
                    "ID": "Avngate"
                },
                "Name": "Model Description",
                "ClassID": "PID",
                "Association": [
                    {
                        "Object": {
                            "ID": "test.svg",
                            "Context": {
                                "ID": "Avngate"
                            }
                        },
                        "@type": "is fulfilled by"
                    },
                    {
                        "Object": {
                            "ID": "Project Description",
                            "Context": {
                                "ID": "Avngate"
                            }
                        },
                        "@type": "is an element of"
                    }
                ],
                "Characteristic": [
                    {
                        "Name": "InfoType",
                        "Value": "image/svg+xml"
                    },
                    {
                        "Name": "LOCK",
                        "Value": false
                    },
                    {
                        "Name": "EXFI",
                        "Value": 10000
                    }
                ]
            },
            {
                "ID": "my_second_id",
                "Context": {
                    "ID": "Avngate2"
                },
                "Name": "Model Description2",
                "ClassID": "PID2",
                "Association": [
                    {
                        "Object": {
                            "ID": "test2.svg",
                            "Context": {
                                "ID": "Avngate"
                            }
                        },
                        "@type": "is fulfilled by"
                    }
                ],
                "Characteristic": [
                    {
                        "Name": "Dbtencoding",
                        "Value": "unicode"
                    }
                ]
            }
        ]
    }
}

我想构建两个 Dataframe ,如下所示:

第二个 Dataframe 如下所示:

最好的方法是什么?如果太复杂,我也可以将特征保存为一个单独的表,像关联一样引用objectId。

dsekswqp

dsekswqp1#

读取第一个变量的json和groupBy,选择第二个变量并使用explode。

df1 = spark.read.json('test.json', multiLine=True)
df2 = df1.select(f.explode('vl:VNETList.Template.Object').alias('value')) \
  .select('value.*')

df_f1 = df2.withColumn('Characteristic', f.explode('Characteristic')) \
   .groupBy('ID', 'Name', 'ClassId') \
   .pivot('Characteristic.Name') \
   .agg(f.first('Characteristic.Value'))

df_f2 = df2.withColumn('Association', f.explode('Association')) \
   .select('ID', 'Association.Object.ID', 'Association.@Type') \
   .toDF('ID', 'AssociationId', 'AssociationType')

df_f1.show()
df_f2.show()

+------------+------------------+-------+-----------+-----+-------------+-----+
|          ID|              Name|ClassId|Dbtencoding| EXFI|     InfoType| LOCK|
+------------+------------------+-------+-----------+-----+-------------+-----+
| my_first_id| Model Description|    PID|       null|10000|image/svg+xml|false|
|my_second_id|Model Description2|   PID2|    unicode| null|         null| null|
+------------+------------------+-------+-----------+-----+-------------+-----+

+------------+-------------------+----------------+
|          ID|      AssociationId| AssociationType|
+------------+-------------------+----------------+
| my_first_id|           test.svg| is fulfilled by|
| my_first_id|Project Description|is an element of|
|my_second_id|          test2.svg| is fulfilled by|
+------------+-------------------+----------------+

相关问题