JSON字符串对象到Pyspark中的DataFrame

b5buobof 于 2023-08-02 发布在 Spark

关注(0)|答案(1)|浏览(110)

我试图将存储在变量中的JSON字符串转换为spark Dataframe ，而不指定列名，因为我有大量不同的表，所以它必须动态。我设法用sc.parallelize来做，但由于我在数据库中工作，我们正在迁移到Unity Catalog，我必须创建共享访问集群，sc.parallelize和其他一些库无法工作。
我有一个JSON字符串存储在变量中，看起来像这样，但最初它有更多的值：

value_json = [{'id': '00043b01-c002-4df6-b453-8d7cd043e1a1', 'classification': None, 'createdDateTime': '2018-08-02T17:04:48Z', 'proxyAddresses': ['SMTP:Atividades789@softwareone.onmicrosoft.com', 'SPO:SPO_4e3b75d6-716f-40c3-8b59-3b474c59a9f8@SPO_[REDACTED]'], 'creationOptions': ['ExchangeProvisioningFlags:481']}, {'id': '00086d95-a5ac-4ad7-b81c-4c1561c49cb1', 'classification': None, 'createdDateTime': '2018-06-18T15:27:24Z', 'proxyAddresses': ['SMTP:Atividades789@softwareone.onmicrosoft.com', 'SPO:SPO_4e3b75d6-716f-40c3-8b59-3b474c59a9f8@SPO_[REDACTED]'], 'creationOptions': []}]

字符串
正如我已经提到的，我设法做到了这一点，但我需要另一种解决方案，可能与pyspark

import json
df = sc.parallelize(value_json).map(lambda x: json.dumps(x))
df2 = spark.read.json(df)
display(df2)

型
测试结果：
| 创建日期时间|||
| --|--| ------------ |
| 2018-06-18T15：27：24Z| 00086d95-a5ac-4ad7-b81c-4c1561c49cb1| 00086d95-a5ac-4ad7-b81c-4c1561c49cb1 |
| 2018-08-02T17：04：48Z| 00043b01-c002-4df6-b453-8d7cd043e1a1| 00043b01-c002-4df6-b453-8d7cd043e1a1 |

pyspark

来源：https://stackoverflow.com/questions/76624246/json-string-object-to-dataframe-in-pyspark

1条答案

按热度按时间

vlf7wbxs1#

下面是你可以做到的方法。使用Pandas创建Pyspark数据框：

import pandas as pd

value_json = [{'id': '00043b01-c002-4df6-b453-8d7cd043e1a1', 'classification': None, 'createdDateTime': '2018-08-02T17:04:48Z', 'proxyAddresses': ['SMTP:Atividades789@softwareone.onmicrosoft.com', 'SPO:SPO_4e3b75d6-716f-40c3-8b59-3b474c59a9f8@SPO_[REDACTED]'], 'creationOptions': ['ExchangeProvisioningFlags:481']}, {'id': '00086d95-a5ac-4ad7-b81c-4c1561c49cb1', 'classification': None, 'createdDateTime': '2018-06-18T15:27:24Z', 'proxyAddresses': ['SMTP:Atividades789@softwareone.onmicrosoft.com', 'SPO:SPO_4e3b75d6-716f-40c3-8b59-3b474c59a9f8@SPO_[REDACTED]'], 'creationOptions': []}]

dt = pd.DataFrame(value_json)

sp_df = spark.createDataFrame(dt)
display(sp_df)

字符串

的数据
正如你所说，你得到了Error：error：ValueError：某些类型在推断后无法确定
你也可以试试下面的方法

import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ArrayType

spark = SparkSession.builder.getOrCreate()

value_json = [{'id': '00043b01-c002-4df6-b453-8d7cd043e1a1', 'classification': None, 'createdDateTime': '2018-08-02T17:04:48Z', 'proxyAddresses': ['SMTP:Atividades789@softwareone.onmicrosoft.com', 'SPO:SPO_4e3b75d6-716f-40c3-8b59-3b474c59a9f8@SPO_[REDACTED]'], 'creationOptions': ['ExchangeProvisioningFlags:481']}, {'id': '00086d95-a5ac-4ad7-b81c-4c1561c49cb1', 'classification': None, 'createdDateTime': '2018-06-18T15:27:24Z', 'proxyAddresses': ['SMTP:Atividades789@softwareone.onmicrosoft.com', 'SPO:SPO_4e3b75d6-716f-40c3-8b59-3b474c59a9f8@SPO_[REDACTED]'], 'creationOptions': []}]

schema = StructType([
    StructField("id", StringType(), True),
    StructField("classification", StringType(), True),
    StructField("createdDateTime", StringType(), True),
    StructField("proxyAddresses", ArrayType(StringType()), True),
    StructField("creationOptions", ArrayType(StringType()), True)
])

dt = pd.DataFrame(value_json)

sp_df = spark.createDataFrame(dt, schema=schema)

sp_df.show()

型

的

赞(0）回复(0）举报 2023-08-02

我来回答

JSON字符串对象到Pyspark中的DataFrame

1条答案

相关问题

热门标签

最新问答