pyspark中的派生架构给出值错误

dm7nw8vv  于 2022-12-11  发布在  Spark
关注(0)|答案(1)|浏览(89)

我有一个来自rest API调用的JSON响应:

{
    "metadata": {
        "count": 2
    },
    "payload": [
                {
            "id": "12",
            "id1": "90",
            "id2": "2",
            "year": " 2025"
        },
        {
            "id": "13",
            "id1": "100",
            "id2": "3",
            "year": " 2023"
        }
    ]
}

我想写一个模式来传递给一个UDF。
它看起来是这样的:

schema = StructType([
    StructField("metadata", StringType(), True),
    StructField("payload", ArrayType(
    StructType([
        StructField("id", IntegerType()),
        StructField("id1", IntegerType())
    ])
    ))
])

将此模式传递给udf并调用它时,将出现以下错误:

'ValueError: Unexpected tuple with StructType

我尝试从Rest API响应生成模式,并期望它返回JSON类型。

h4cxqtbf

h4cxqtbf1#

您遗漏了巢状结构类型“count”,而且“id”、“id 1”等的类型也不相符。
此外,在将数据传递给createDateframe()时,类型必须是可迭代的,如数组、列表等。它不接受字典或JSON对象。
正确的模式如下所示:

data = {"metadata":{"count":2},"payload":[{"id":"12","id1":"90","id2":"2","year":"2025"},{"id":"13","id1":"100","id2":"3","year":"2023"}]}

schema = StructType([
    StructField(
        "metadata", 
        StructType([
            StructField("count", IntegerType())
        ])
    ),
    StructField(
        "payload",
        ArrayType(
            StructType([
                StructField("id", StringType()),
                StructField("id1", StringType()),
                StructField("id2", StringType()),
                StructField("year", StringType()),
            ])
        )
    )
])

df = spark.createDataFrame(data=[data], schema=schema)

df.show(truncate=False)

+--------+---------------------------------------+
|metadata|payload                                |
+--------+---------------------------------------+
|{2}     |[{12, 90, 2, 2025}, {13, 100, 3, 2023}]|
+--------+---------------------------------------+

相关问题