Pyspark - JSON字符串列分解为多个,而不提及模式

jm81lzqq  于 12个月前  发布在  Spark
关注(0)|答案(2)|浏览(115)

我有下面的JSON字符串作为一个列在一个pyspark框架。

{
   "result":{
      "version":"1.2",
      "timeStamp":"2023-08-14 14:00:12",
      "description":"",
      "data":{
         "DateTime_Received":"2023-08-14T14:01:10.4516457+01:00",
         "DateTime_Actual":"2023-08-14T14:00:12",
         "OtherInfo":null,
         "main":[
            {
               "Status":0,
               "ID":111,
               "details":null
            }
         ]
      },
      "tn":"aaa"
   }
}

我想把上面的一个分解成多个列,而不对模式进行硬编码。
我尝试使用schema_of_json从json字符串生成schema。

df_decoded = df_decoded.withColumn("json_column", F.when(F.col("value").isNotNull(), F.col("value")).otherwise("{}"))

# Infer the schema using schema_of_json
json_schema = df_decoded.select(F.schema_of_json(F.col("json_column"))).collect()[0][0]

df_decoded是我的嵌套结构,value是我的json字符串列名称。
但它给了我下面的错误-

AnalysisException: cannot resolve 'schema_of_json(json_column)' due to data type mismatch: The input json should be a foldable string expression and not null; however, got json_column.;

我的预期输出-

8ehkhllq

8ehkhllq1#

这让你开始上路了吗?

import json
import pandas as pd

j = '''{
   "result":{
      "version":"1.2",
      "timeStamp":"2023-08-14 14:00:12",
      "description":"",
      "data":{
         "DateTime_Received":"2023-08-14T14:01:10.4516457+01:00",
         "DateTime_Actual":"2023-08-14T14:00:12",
         "OtherInfo":null,
         "main":[
            {
               "Status":0,
               "ID":111,
               "details":null
            }
         ]
      },
      "tn":"aaa"
   }
}'''

text_json = json.loads(j)
result=text_json.get("result", "")
print(result.get("version", ""))

results = [result["version"], result["timeStamp"], result["description"], result["data"], result["tn"] ]
df = pd.DataFrame(results).transpose()
print(df)

我没有一个真实的应用程序来玩.transpose()是变化。
https://stackoverflow.com/a/77263073/22187484这个人对分组和筛选有一个复杂的答案,可能也有帮助。

k5hmc34c

k5hmc34c2#

使用sparks推理引擎获取json列的模式,然后将json列转换为struct,然后使用select expression将struct字段分解为列

schema = spark.read.json(df.rdd.map(lambda r: r['value'])).schema
result = df.withColumn('value', F.from_json('value', schema)).select('*', 'value.result.*')
+--------------------+--------------------+-----------+-------------------+---+-------+
|               value|                data|description|          timeStamp| tn|version|
+--------------------+--------------------+-----------+-------------------+---+-------+
|{{{2023-08-14T14:...|{2023-08-14T14:00...|           |2023-08-14 14:00:12|aaa|    1.2|
+--------------------+--------------------+-----------+-------------------+---+-------+

相关问题