我尝试只提取routriarys.element和validatingAirlineCodes,然后在Pyspark中形成只包含这两个字段的json
|-- id: string (nullable = true)
|-- instantTicketingRequired: boolean (nullable = true)
|-- itineraries: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
|-- lastTicketingDate: string (nullable = true)
|-- nonHomogeneous: boolean (nullable = true)
|-- numberOfBookableSeats: long (nullable = true)
|-- oneWay: boolean (nullable = true)
|-- price: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- pricingOptions: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: string (containsNull = true)
|-- source: string (nullable = true)
|-- travelerPricings: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
|-- type: string (nullable = true)
|-- validatingAirlineCodes: array (nullable = true)
| |-- element: string (containsNull = true)
我尝试使用df.select(),但无法选择所需的字段。我该怎么办?
1条答案
按热度按时间n6lpvg4x1#
您的问题可能会在Spark By Examples上的此文档页面中得到解答。
https://sparkbyexamples.com/pyspark/select-columns-from-pyspark-dataframe/
需要注意的是,您尝试提取的(元素)是一个Map数组。如果不首先分解该数组并继续提取元素,则没有逻辑方法可以提取它。
pySpark中的爆炸文档如下:https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.explode.html
希望这有帮助!