hive Spark SQL从配置单元Map数据类型列中选择相异记录

tpxzln5u  于 2022-11-29  发布在  Hive
关注(0)|答案(2)|浏览(183)

我有一个列类型为MAP的配置单元表,在运行以下Spark SQL查询时出现错误:

表格结构:

第一个
分析异常:调用集合操作(intersect、except等)的DataFrame中不能有map类型的列,但列明细的类型为map〈string,string〉;

df = spark.sql("""select name, details 
                  from table_name
                  group by name, details""")

分析异常:表达式table_name.details不能用作分组表达式,因为其数据类型Map〈string,string〉不是可排序的数据类型。

df = spark.sql("""
            WITH cte_row_num AS (
                SELECT name
                       ,details
                       ,ROW_NUMBER() OVER (
                              PARTITION BY name
                                          ,details 
                              ORDER BY name) as row_num 
                FROM table_name) 
            SELECT name
                  ,details 
            FROM cte_row_num 
            WHERE row_num = 1
           """)

java.lang.IllegalStateException:分组/联接/窗口分区键不能是Map类型

bmp9r5qi

bmp9r5qi1#

您可以先使用ROW_NUMBER()窗口函数枚举分区内的行,然后只选择ROW_NUMBER结果为1的行。
输入示例:

df = spark.createDataFrame([('n', {'m': '1'}), ('n', {'m': '1'})], ['name', 'details'])
df.createOrReplaceTempView("table_name")
df.show()
# +----+--------+
# |name| details|
# +----+--------+
# |   n|{m -> 1}|
# |   n|{m -> 1}|
# +----+--------+

仅提取不同的记录:

df_row_num = spark.sql("""
    WITH cte_row_num AS (
        SELECT name
              ,details 
              ,ROW_NUMBER() OVER (
                  PARTITION BY name
                              ,sort_array(map_keys(details))
                              ,sort_array(map_values(details))
                  ORDER BY name) as row_num
        FROM table_name)
    SELECT name
          ,details 
    FROM cte_row_num
    WHERE row_num = 1
""")

df_row_num.show()
# +----+--------+
# |name| details|
# +----+--------+
# |   n|{m -> 1}|
# +----+--------+
7hiiyaii

7hiiyaii2#

看起来你可以用map_entries把map列转换成结构体数组,然后取一个distinct,再把它转换回map列。
下面是一个工作示例

data_sdf.show()

# +----+--------+
# |name| details|
# +----+--------+
# |   n|{m -> 1}|
# |   n|{m -> 1}|
# +----+--------+

data_sdf.createOrReplaceTempView('data_tbl')

spark.sql('''
    select name, map_from_entries(details_entries) as details
    from (
        select distinct name, sort_array(map_entries(details)) as details_entries
        from data_tbl)
    '''). \
    show()

# +----+--------+
# |name| details|
# +----+--------+
# |   n|{m -> 1}|
# +----+--------+

相关问题