我有一个pyspark的学生数据框架,其模式如下:
Id: string
|-- School: array
|-- element: struct
| |-- Subject: string
| |-- Classes: string
| |-- Score: array
| |-- element: struct
| |-- ScoreID: string
| |-- Value: string
我想从Dataframe中提取一些字段并对其进行规范化,以便将其输入到数据库中。我期望的关系模式由以下字段组成 Id, School, Subject, ScoreId, Value
. 我怎样才能有效地做到这一点?
1条答案
按热度按时间qnakjoqk1#
explode
数组以获取展平数据,然后选择所有必需的列。Example:
```df.show(10,False)
+---+--------------------------+
|Id |School |
+---+--------------------------+
|1 |b, [[A, 3], [B, 4]], a|
+---+--------------------------+
df.printSchema()
root
|-- Id: string (nullable = true)
|-- School: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Classes: string (nullable = true)
| | |-- Score: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- ScoreID: string (nullable = true)
| | | | |-- Value: string (nullable = true)
| | |-- Subject: string (nullable = true)
df.selectExpr("Id","explode(School)").
selectExpr("Id","col.","explode(col.Score)").
selectExpr("Id","Classes","Subject","col.").
show()
+---+-------+-------+-------+-----+
| Id|Classes|Subject|ScoreID|Value|
+---+-------+-------+-------+-----+
| 1| b| a| A| 3|
| 1| b| a| B| 4|
+---+-------+-------+-------+-----+