使用Spark SQL从配置单元Map类型列中选择不同的记录

4ioopgfo  于 2022-10-07  发布在  Spark
关注(0)|答案(2)|浏览(215)

我有一个带有列类型Map的配置单元表,在运行以下Spark SQL查询时遇到错误:

df = spark.sql("""select distinct name, details from table_name""");

分析异常:调用集合操作(INTERSECT、EXCEPT等)的DataFrame中不能有Map类型的列,但列明细类型为map<字符串,字符串>;

df = spark.sql("""select name, details 
                  from table_name
                  group by name, details""");

分析异常:表达式TABLE_NAME.DETAILS不能用作分组表达式,因为其数据类型Map<STRING,STRING>不是可排序的数据类型。

表:

Column_name         datatype
----------------------------------------
name                string
details             map<string,string>
llycmphe

llycmphe1#

您可以首先使用ROW_NUMBER()窗口函数枚举分区内的行,然后只选择ROW_NUMBER结果为1的行。

示例输入:

df = spark.createDataFrame([('n', {'m': '1'}), ('n', {'m': '1'})], ['name', 'details'])
df.createOrReplaceTempView("table_name")
df.show()

# +----+--------+

# |name| details|

# +----+--------+

# |   n|{m -> 1}|

# |   n|{m -> 1}|

# +----+--------+

仅提取不同的记录:

df_row_num = spark.sql("""
    WITH cte_row_num AS (
        SELECT name
              ,details 
              ,ROW_NUMBER() OVER (
                  PARTITION BY name
                              ,sort_array(map_keys(details))
                              ,sort_array(map_values(details))
                  ORDER BY name) as row_num
        FROM table_name)
    SELECT name
          ,details 
    FROM cte_row_num
    WHERE row_num = 1
""")

df_row_num.show()

# +----+--------+

# |name| details|

# +----+--------+

# |   n|{m -> 1}|

# +----+--------+
guz6ccqo

guz6ccqo2#

似乎可以使用map_entries将Map列转换为结构数组,然后获取DISTINCT。然后将其转换回Map列。

以下是一个有效的示例

data_sdf.show()

# +----+--------+

# |name| details|

# +----+--------+

# |   n|{m -> 1}|

# |   n|{m -> 1}|

# +----+--------+

data_sdf.createOrReplaceTempView('data_tbl')

spark.sql('''
    select name, map_from_entries(details_entries) as details
    from (
        select distinct name, sort_array(map_entries(details)) as details_entries
        from data_tbl)
    '''). 
    show()

# +----+--------+

# |name| details|

# +----+--------+

# |   n|{m -> 1}|

# +----+--------+

相关问题