从Pyspark中的json提取字段

gdrx4gfi  于 2023-02-07  发布在  Spark
关注(0)|答案(1)|浏览(167)

我尝试只提取routriarys.element和validatingAirlineCodes,然后在Pyspark中形成只包含这两个字段的json

|-- id: string (nullable = true)
 |-- instantTicketingRequired: boolean (nullable = true)
 |-- itineraries: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |-- lastTicketingDate: string (nullable = true)
 |-- nonHomogeneous: boolean (nullable = true)
 |-- numberOfBookableSeats: long (nullable = true)
 |-- oneWay: boolean (nullable = true)
 |-- price: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- pricingOptions: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- source: string (nullable = true)
 |-- travelerPricings: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |-- type: string (nullable = true)
 |-- validatingAirlineCodes: array (nullable = true)
 |    |-- element: string (containsNull = true)

我尝试使用df.select(),但无法选择所需的字段。我该怎么办?

n6lpvg4x

n6lpvg4x1#

您的问题可能会在Spark By Examples上的此文档页面中得到解答。
https://sparkbyexamples.com/pyspark/select-columns-from-pyspark-dataframe/
需要注意的是,您尝试提取的(元素)是一个Map数组。如果不首先分解该数组并继续提取元素,则没有逻辑方法可以提取它。
pySpark中的爆炸文档如下:https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.explode.html
希望这有帮助!

相关问题