JSON字符串对象到Pyspark中的DataFrame

b5buobof  于 2023-08-02  发布在  Spark
关注(0)|答案(1)|浏览(109)

我试图将存储在变量中的JSON字符串转换为spark Dataframe ,而不指定列名,因为我有大量不同的表,所以它必须动态。我设法用sc.parallelize来做,但由于我在数据库中工作,我们正在迁移到Unity Catalog,我必须创建共享访问集群,sc.parallelize和其他一些库无法工作。
我有一个JSON字符串存储在变量中,看起来像这样,但最初它有更多的值:

value_json = [{'id': '00043b01-c002-4df6-b453-8d7cd043e1a1', 'classification': None, 'createdDateTime': '2018-08-02T17:04:48Z', 'proxyAddresses': ['SMTP:Atividades789@softwareone.onmicrosoft.com', 'SPO:SPO_4e3b75d6-716f-40c3-8b59-3b474c59a9f8@SPO_[REDACTED]'], 'creationOptions': ['ExchangeProvisioningFlags:481']}, {'id': '00086d95-a5ac-4ad7-b81c-4c1561c49cb1', 'classification': None, 'createdDateTime': '2018-06-18T15:27:24Z', 'proxyAddresses': ['SMTP:Atividades789@softwareone.onmicrosoft.com', 'SPO:SPO_4e3b75d6-716f-40c3-8b59-3b474c59a9f8@SPO_[REDACTED]'], 'creationOptions': []}]

字符串
正如我已经提到的,我设法做到了这一点,但我需要另一种解决方案,可能与pyspark

import json
df = sc.parallelize(value_json).map(lambda x: json.dumps(x))
df2 = spark.read.json(df)
display(df2)


测试结果:
| 创建日期时间|||
| --|--| ------------ |
| 2018-06-18T15:27:24Z| 00086d95-a5ac-4ad7-b81c-4c1561c49cb1| 00086d95-a5ac-4ad7-b81c-4c1561c49cb1 |
| 2018-08-02T17:04:48Z| 00043b01-c002-4df6-b453-8d7cd043e1a1| 00043b01-c002-4df6-b453-8d7cd043e1a1 |

vlf7wbxs

vlf7wbxs1#

下面是你可以做到的方法。使用Pandas创建Pyspark数据框:

import pandas as pd

value_json = [{'id': '00043b01-c002-4df6-b453-8d7cd043e1a1', 'classification': None, 'createdDateTime': '2018-08-02T17:04:48Z', 'proxyAddresses': ['SMTP:Atividades789@softwareone.onmicrosoft.com', 'SPO:SPO_4e3b75d6-716f-40c3-8b59-3b474c59a9f8@SPO_[REDACTED]'], 'creationOptions': ['ExchangeProvisioningFlags:481']}, {'id': '00086d95-a5ac-4ad7-b81c-4c1561c49cb1', 'classification': None, 'createdDateTime': '2018-06-18T15:27:24Z', 'proxyAddresses': ['SMTP:Atividades789@softwareone.onmicrosoft.com', 'SPO:SPO_4e3b75d6-716f-40c3-8b59-3b474c59a9f8@SPO_[REDACTED]'], 'creationOptions': []}]

dt = pd.DataFrame(value_json)

sp_df = spark.createDataFrame(dt)
display(sp_df)

字符串


的数据
正如你所说,你得到了Error:error:ValueError:某些类型在推断后无法确定
你也可以试试下面的方法

import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ArrayType

spark = SparkSession.builder.getOrCreate()

value_json = [{'id': '00043b01-c002-4df6-b453-8d7cd043e1a1', 'classification': None, 'createdDateTime': '2018-08-02T17:04:48Z', 'proxyAddresses': ['SMTP:Atividades789@softwareone.onmicrosoft.com', 'SPO:SPO_4e3b75d6-716f-40c3-8b59-3b474c59a9f8@SPO_[REDACTED]'], 'creationOptions': ['ExchangeProvisioningFlags:481']}, {'id': '00086d95-a5ac-4ad7-b81c-4c1561c49cb1', 'classification': None, 'createdDateTime': '2018-06-18T15:27:24Z', 'proxyAddresses': ['SMTP:Atividades789@softwareone.onmicrosoft.com', 'SPO:SPO_4e3b75d6-716f-40c3-8b59-3b474c59a9f8@SPO_[REDACTED]'], 'creationOptions': []}]

schema = StructType([
    StructField("id", StringType(), True),
    StructField("classification", StringType(), True),
    StructField("createdDateTime", StringType(), True),
    StructField("proxyAddresses", ArrayType(StringType()), True),
    StructField("creationOptions", ArrayType(StringType()), True)
])

dt = pd.DataFrame(value_json)

sp_df = spark.createDataFrame(dt, schema=schema)

sp_df.show()


相关问题