pyspark 在SparkSQL中，如何从嵌套结构中选择列的子集，并使用SQL语句将其作为嵌套结构保存在结果中？

olhwl3o2 于 2023-02-18 发布在 Spark

关注(0)|答案(2)|浏览(197)

我可以在SparkSQL中执行以下语句：

result_df = spark.sql("""select
    one_field,
    field_with_struct
  from purchases""")

结果 Dataframe 的字段将具有field_with_struct中的完整结构。
| 一个字段|带结构的字段|
| - ------|- ------|
| 一百二十三|{名称1，值1，值2，f2，f4}|
| 五百五十五|{名称2、值3、值4、f6、f7}|
我只想从field_with_struct中选择几个字段，但在结果 Dataframe 中仍将它们保留在struct中。如果可能的话（这不是真正的代码）：

result_df = spark.sql("""select
    one_field,
    struct(
      field_with_struct.name,
      field_with_struct.value2
    ) as my_subset
  from purchases""")

要得到这个：
| 一个字段|我的子集|
| - ------|- ------|
| 一百二十三|{名称1，值2}|
| 五百五十五|{名称2，值4}|
有没有办法用SQL来实现这一点？（不能用Fluent API）

pyspark

来源：https://stackoverflow.com/questions/72014406/in-sparksql-how-could-i-select-a-subset-of-columns-from-a-nested-struct-and-keep

2条答案

按热度按时间

waxmsbnn1#

有一个使用arrays_zip的简单得多的解决方案，不需要explode/collect_list（对于复杂数据，这可能容易出错/很困难，因为它依赖于使用类似id列的东西）：

>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import arrays_zip
>>> df = sc.createDataFrame((([Row(x=1, y=2, z=3), Row(x=2, y=3, z=4)],),), ['array_of_structs'])
>>> df.show(truncate=False)
+----------------------+
|array_of_structs      |
+----------------------+
|[{1, 2, 3}, {2, 3, 4}]|
+----------------------+
>>> df.printSchema()
root
 |-- array_of_structs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- x: long (nullable = true)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: long (nullable = true)
>>> # Selecting only two of the nested fields:
>>> selected_df = df.select(arrays_zip("array_of_structs.x", "array_of_structs.y").alias("array_of_structs"))
>>> selected_df.printSchema()
root
 |-- array_of_structs: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- x: long (nullable = true)
 |    |    |-- y: long (nullable = true)
>>> selected_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+

EDIT添加相应的Spark SQL代码，因为这是OP要求的：

>>> df.createTempView("test_table")
>>> sql_df = sc.sql("""
SELECT
transform(array_of_structs, x -> struct(x.x, x.y)) as array_of_structs
FROM test_table
""")
>>> sql_df.printSchema()
root
 |-- array_of_structs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- x: long (nullable = true)
 |    |    |-- y: long (nullable = true)
>>> sql_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+

赞(0）回复(0）举报 2023-02-18

fcy6dtqo2#

事实上，我提供的伪代码是有效的。对于对象的嵌套数组来说，这不是那么简单。首先，数组应该被分解（EXPLODE（）函数），然后选择一个子集。之后，就可以创建一个COLLECT_LIST（）。

WITH
  unfold_by_items AS (SELECT id, EXPLODE(Items) AS item FROM spark_tbl_items)
, format_items as (SELECT
    id
    , STRUCT(
              item.item_id
            , item.name
        ) AS item
    FROM unfold_by_items)
, fold_by_items AS (SELECT id, COLLECT_LIST(item) AS Items FROM format_items GROUP BY id)

SELECT * FROM fold_by_items

这将只从Items中的结构中选择两个字段，并最终返回一个数据集，该数据集再次包含一个具有Items的数组。

赞(0）回复(0）举报 2023-02-18

我来回答

pyspark 在SparkSQL中，如何从嵌套结构中选择列的子集，并使用SQL语句将其作为嵌套结构保存在结果中？

2条答案

相关问题

热门标签

最新问答