如何访问pysparkDataframe中的结构元素？

nqwrtyyt 于 2021-05-29 发布在 Spark

关注(0)|答案(1)|浏览(392)

对于pysparkDataframe，我有以下模式

root
 |-- maindata: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- label: string (nullable = true)
 |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- unit: string (nullable = true)
 |    |    |    |-- dateTime: string (nullable = true)

给出我收到的某一行的数据 df.select(F.col("maindata")).show(1,False) ```
|[[[a1, 43.24, km/h, 2019-04-06T13:02:08.020], [TripCount, 135, , 2019-04-06T13:02:08.790],["t2", 0, , 2019-04-06T13:02:08.040], [t4, 0, , 2019-04-06T13:02:08.050], [t09, 0, , 2019-04-06T13:02:08.050], [t3, 1, , 2019-04-06T13:02:08.050], [t7, 0, , 2019-04-06T13:02:08.050],[TripCount, ,136, 2019-04-06T13:02:08.790]]

我要访问此ex中的tripcount值： `[TripCount -> 136,135 etc` ，访问此数据的最佳方法是什么？tripc存在多次，是否有任何方法可以访问，例如仅标记maindata.label等数据。。？

python apache-spark pyspark

来源：https://stackoverflow.com/questions/62574947/how-access-struct-elements-inside-pyspark-dataframe

1条答案

按热度按时间

6l7fqoea1#

我建议这样做 explode 多次将数组元素转换为单独的行，然后将struct转换为单独的列，或者使用点语法处理嵌套元素。例如：

from pyspark.sql.functions import col, explode
df=spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d'))
>>> df2.printSchema()
root
 |-- data: struct (nullable = true)
 |    |-- _1: string (nullable = true)
 |    |-- _2: string (nullable = true)
 |    |-- _3: string (nullable = true)
>>> df2.filter(col("data._1") == "k1").show()
+------------+
|        data|
+------------+
|[k1, v1, v2]|
+------------+

或者可以将结构的成员提取为单独的列：

from pyspark.sql.functions import col, explode
df = spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d')).select("d.*").drop("d")
>>> df2.printSchema()
root
 |-- _1: string (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: string (nullable = true)

>>> df2.filter(col("_1") == "k1").show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| k1| v1| v2|
+---+---+---+

赞(0）回复(0）举报 2021-05-29

我来回答

如何访问pysparkDataframe中的结构元素？

1条答案

相关问题

热门标签

最新问答