如何访问pysparkDataframe中的结构元素?

nqwrtyyt  于 2021-05-29  发布在  Spark
关注(0)|答案(1)|浏览(392)

对于pysparkDataframe,我有以下模式

root
 |-- maindata: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- label: string (nullable = true)
 |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- unit: string (nullable = true)
 |    |    |    |-- dateTime: string (nullable = true)

给出我收到的某一行的数据 df.select(F.col("maindata")).show(1,False) ```
|[[[a1, 43.24, km/h, 2019-04-06T13:02:08.020], [TripCount, 135, , 2019-04-06T13:02:08.790],["t2", 0, , 2019-04-06T13:02:08.040], [t4, 0, , 2019-04-06T13:02:08.050], [t09, 0, , 2019-04-06T13:02:08.050], [t3, 1, , 2019-04-06T13:02:08.050], [t7, 0, , 2019-04-06T13:02:08.050],[TripCount, ,136, 2019-04-06T13:02:08.790]]

我要访问此ex中的tripcount值: `[TripCount -> 136,135 etc` ,访问此数据的最佳方法是什么?tripc存在多次,是否有任何方法可以访问,例如仅标记maindata.label等数据。。?
6l7fqoea

6l7fqoea1#

我建议这样做 explode 多次将数组元素转换为单独的行,然后将struct转换为单独的列,或者使用点语法处理嵌套元素。例如:

from pyspark.sql.functions import col, explode
df=spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d'))
>>> df2.printSchema()
root
 |-- data: struct (nullable = true)
 |    |-- _1: string (nullable = true)
 |    |-- _2: string (nullable = true)
 |    |-- _3: string (nullable = true)
>>> df2.filter(col("data._1") == "k1").show()
+------------+
|        data|
+------------+
|[k1, v1, v2]|
+------------+

或者可以将结构的成员提取为单独的列:

from pyspark.sql.functions import col, explode
df = spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d')).select("d.*").drop("d")
>>> df2.printSchema()
root
 |-- _1: string (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: string (nullable = true)

>>> df2.filter(col("_1") == "k1").show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| k1| v1| v2|
+---+---+---+

相关问题