在PySpark Dataframe列中拆分复杂字符串

k7fdbhmy  于 2022-11-21  发布在  Spark
关注(0)|答案(1)|浏览(195)

我有一个由多个地址组成的PySpark Dataframe 列,格式如下:

id       addresses
1       [{"city":null,"state":null,"street":"123, ABC St, ABC  Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}]

我想把它改造成如下:
| 标识符|城市|状态|街道|邮政编码|乡村|
| - -|- -|- -|- -|- -|- -|
| 一个|零值|零值|ABC广场ABC街123号|小行星1111|美国|
| 一个|达拉斯|传输|测试街DEF广场456号|九九九九九九|美国|
关于如何使用PySpark实现这一点的任何输入?数据集很大(几TB),所以希望以一种有效的方式完成这一点。
我试着用逗号分割地址字符串,但是由于地址中也有逗号,所以输出不是预期的结果。我想我需要使用一个带括号的正则表达式模式,但不知道如何使用。此外,我该如何对数据进行反规范化呢?

nkoocmlb

nkoocmlb1#

数据数量

from pyspark.sql.functions import *
df =spark.createDataFrame([(1,'{"city":"New York","state":"NY","street":"123, ABC St, ABC  Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}')],
                         ('id','addresses'))
df.show(truncate=False)

#pass the string column to rdd to extracr schema
rdd=df.select(col("addresses").alias("jsoncol")).rdd.map(lambda x: x.jsoncol)
newschema =spark.read.json(rdd).schema

##Apply schema to string column reading using from_schema
df3=df.select("*",from_json("addresses", newschema).alias("test_col"))#Assign schema to column using select

df3.select('id','test_col.*').show()

+---+--------+-------+----------+-----+------------------------+
|id |city    |country|postalCode|state|street                  |
+---+--------+-------+----------+-----+------------------------+
|1  |New York|USA    |11111     |NY   |123, ABC St, ABC  Square|
+---+--------+-------+----------+-----+------------------------+

相关问题