在PySpark中使用动态键扁平化嵌套JSON结构

5n0oy7gb 于 2023-05-06 发布在 Spark

关注(0)|答案(1)|浏览(124)

我正在尝试使用PySpark处理json文件，这些文件包含一个带有动态键的struct列。
struct列的模式如下所示：

{
  "UUID_KEY": {
     "time": STRING
     "amount": INTEGER
  }
}

数据如下所示：
| 身份证|json柱|
| --------------|--------------|
| 1|“{1：{金额：1，时间：2}，2：{金额：10，时间：5}}”|
| 二|“{3：{金额：1，时间：2}，4：{金额：10，时间：5}”|
目前，我将struct列作为string，因为通过指定/推断模式来加载JSON并不**工作，因为第一层的键是随机生成的，而且数据太多了。第二层总是相同的，它包含amount和time。
有没有办法在不知道第一层的键的情况下，将这个JSON字符串扁平化为amount和time列？

pyspark

来源：https://stackoverflow.com/questions/76180934/flatten-nested-json-struct-with-dynamic-keys-in-pyspark

1条答案

按热度按时间

xmjla07d1#

这将工作：

map_schema=MapType(StringType(), StructType([\
    StructField('amount', StringType(), True),\
    StructField('time', StringType(),True)\
]));

df\
.withColumn("json_column", F.from_json(F.col("json_column"), map_schema, {"allowUnquotedFieldNames":"true"}))\
.select("*", F.explode("json_column").alias("key", "value"))\
.select("id", "value.*")\
.show(truncate=False)

输入：

+---+---------------------------------------------------+
|id |json_column                                        |
+---+---------------------------------------------------+
|1  |{1: {amount: 1, time: 2}, 2: {amount: 10, time: 5}}|
|2  |{3: {amount: 1, time: 2}, 4: {amount: 10, time: 5}}|
+---+---------------------------------------------------+

root
 |-- id: long (nullable = true)
 |-- json_column: string (nullable = true)

输出：

+---+------+----+
|id |amount|time|
+---+------+----+
|1  |1     |2   |
|1  |10    |5   |
|2  |1     |2   |
|2  |10    |5   |
+---+------+----+

赞(0）回复(0）举报 2023-05-06

我来回答

在PySpark中使用动态键扁平化嵌套JSON结构

1条答案

相关问题

热门标签

最新问答