pyspark 在SparkSQL中,如何从嵌套结构中选择列的子集,并使用SQL语句将其作为嵌套结构保存在结果中?

olhwl3o2  于 2023-02-18  发布在  Spark
关注(0)|答案(2)|浏览(197)

我可以在SparkSQL中执行以下语句:

result_df = spark.sql("""select
    one_field,
    field_with_struct
  from purchases""")

结果 Dataframe 的字段将具有field_with_struct中的完整结构。
| 一个字段|带结构的字段|
| - ------|- ------|
| 一百二十三|{名称1,值1,值2,f2,f4}|
| 五百五十五|{名称2、值3、值4、f6、f7}|
我只想从field_with_struct中选择几个字段,但在结果 Dataframe 中仍将它们保留在struct中。如果可能的话(这不是真正的代码):

result_df = spark.sql("""select
    one_field,
    struct(
      field_with_struct.name,
      field_with_struct.value2
    ) as my_subset
  from purchases""")

要得到这个:
| 一个字段|我的子集|
| - ------|- ------|
| 一百二十三|{名称1,值2}|
| 五百五十五|{名称2,值4}|
有没有办法用SQL来实现这一点?(不能用Fluent API)

waxmsbnn

waxmsbnn1#

有一个使用arrays_zip的简单得多的解决方案,不需要explode/collect_list(对于复杂数据,这可能容易出错/很困难,因为它依赖于使用类似id列的东西):

>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import arrays_zip
>>> df = sc.createDataFrame((([Row(x=1, y=2, z=3), Row(x=2, y=3, z=4)],),), ['array_of_structs'])
>>> df.show(truncate=False)
+----------------------+
|array_of_structs      |
+----------------------+
|[{1, 2, 3}, {2, 3, 4}]|
+----------------------+
>>> df.printSchema()
root
 |-- array_of_structs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- x: long (nullable = true)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: long (nullable = true)
>>> # Selecting only two of the nested fields:
>>> selected_df = df.select(arrays_zip("array_of_structs.x", "array_of_structs.y").alias("array_of_structs"))
>>> selected_df.printSchema()
root
 |-- array_of_structs: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- x: long (nullable = true)
 |    |    |-- y: long (nullable = true)
>>> selected_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+

EDIT添加相应的Spark SQL代码,因为这是OP要求的:

>>> df.createTempView("test_table")
>>> sql_df = sc.sql("""
SELECT
transform(array_of_structs, x -> struct(x.x, x.y)) as array_of_structs
FROM test_table
""")
>>> sql_df.printSchema()
root
 |-- array_of_structs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- x: long (nullable = true)
 |    |    |-- y: long (nullable = true)
>>> sql_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+
fcy6dtqo

fcy6dtqo2#

事实上,我提供的伪代码是有效的。对于对象的嵌套数组来说,这不是那么简单。首先,数组应该被分解(EXPLODE()函数),然后选择一个子集。之后,就可以创建一个COLLECT_LIST()。

WITH
  unfold_by_items AS (SELECT id, EXPLODE(Items) AS item FROM spark_tbl_items)
, format_items as (SELECT
    id
    , STRUCT(
              item.item_id
            , item.name
        ) AS item
    FROM unfold_by_items)
, fold_by_items AS (SELECT id, COLLECT_LIST(item) AS Items FROM format_items GROUP BY id)

SELECT * FROM fold_by_items

这将只从Items中的结构中选择两个字段,并最终返回一个数据集,该数据集再次包含一个具有Items的数组。

相关问题