pyspark 将结构列中的第一个字段提取到字典中

hc2pp10m  于 2022-11-28  发布在  Spark
关注(0)|答案(1)|浏览(160)

我需要从Spark Dataframe 的模式pyspark.sql.types.StructType创建一个字典。
代码需要遍历整个StructType,只找到StructType类型的StructField元素,并且在提取到字典中时,使用父StructFieldname作为key,而value将仅是第一个嵌套/子StructFieldname
示例架构(StructType):

root
|-- field_1: int
|-- field_2: int
|-- field_3: struct
|    |-- date: date
|    |-- timestamp: timestamp
|-- field_4: int

预期结果:

{"field_3": "date"}
0sgqnhkj

0sgqnhkj1#

您可以使用字典解析在架构中导航。

{x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'}

试验1

df = spark.createDataFrame([], 'field_1 int, field_2 int, field_3 struct<date:date,timestamp:timestamp>, field_4 int')

df.printSchema()
# root
#  |-- field_1: integer (nullable = true)
#  |-- field_2: integer (nullable = true)
#  |-- field_3: struct (nullable = true)
#  |    |-- date: date (nullable = true)
#  |    |-- timestamp: timestamp (nullable = true)
#  |-- field_4: integer (nullable = true)

{x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'}
# {'field_3': 'date'}

测试2

df = spark.createDataFrame([], 'field_1 int, field_2 struct<col_int:int,col_long:long>, field_3 struct<date:date,timestamp:timestamp>')

df.printSchema()
# root
#  |-- field_1: integer (nullable = true)
#  |-- field_2: struct (nullable = true)
#  |    |-- col_int: integer (nullable = true)
#  |    |-- col_long: long (nullable = true)
#  |-- field_3: struct (nullable = true)
#  |    |-- date: date (nullable = true)
#  |    |-- timestamp: timestamp (nullable = true)

{x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'}
# {'field_2': 'col_int', 'field_3': 'date'}

相关问题