hive查询优化

ccgok5k5 于 2021-06-27 发布在 Hive

关注(0)|答案(5)|浏览(367)

我的要求是获得学生的id和姓名有超过1个电子邮件id和类型=1。
我使用的查询

select distinct b.id, b.name, b.email, b.type,a.cnt
from (
  select id, count(email) as cnt
  from (
    select distinct id, email
    from table1
  ) c
  group by id
) a
join table1 b on a.id = b.id
where b.type=1
order by b.id

请让我知道这是罚款或任何简单的版本可用。

Sample data is like:
id name email type 
123 AAA abc@xyz.com 1
123 AAA acd@xyz.com 1
123 AAA ayx@xyz.com 3
345 BBB nch@xyz.com 1
345 BBB nch@xyz.com 1
678 CCC iuy@xyz.com 1

Expected Output:
123 AAA abc@xyz.com 1 2
123 AAA acd@xyz.com 1 2
345 BBB nch@xyz.com 1 1
678 CCC iuy@xyz.com 1 1

sql Hive performance query-optimization hiveql

来源：https://stackoverflow.com/questions/58213643/hive-query-optimization

5条答案

按热度按时间

tktrz96b1#

这与您的查询非常相似，但在这里，我在初始步骤（在内部查询中）过滤数据，以便连接不会在较少的数据上发生

select distinct b.id,b.name,b.email,b.type,intr_table.cnt from table1 orig_table join 
    (
        select a.id,a.type,count(a.email) as cnt from table1 as a where a.type=1 group by a
    ) intr_table on inter_table.id=orig_table.id,inter_table.type=orig_table.type

赞(0）回复(0）举报 2021-06-27

km0tfn4u2#

还有一种方法 collect_set 取返回数组的大小计算不同的邮件。
演示：

--your data example
with table1 as ( --use your table instead of this 
  select stack(6,
    123, 'AAA', 'abc@xyz.com', 1,
    123, 'AAA', 'acd@xyz.com', 1,
    123, 'AAA', 'ayx@xyz.com', 3,
    345, 'BBB', 'nch@xyz.com', 1,
    345, 'BBB', 'nch@xyz.com', 1,
    678, 'CCC', 'iuy@xyz.com', 1
  ) as (id, name, email, type )
) 

--query   
select distinct id, name, email, type, 
       size(collect_set(email) over(partition by id)) cnt
  from table1
 where type=1

结果：

id  name    email   type    cnt 
123 AAA  abc@xyz.com    1   2   
123 AAA  acd@xyz.com    1   2   
345 BBB  nch@xyz.com    1   1   
678 CCC  iuy@xyz.com    1   1

我们在这里仍然需要distinct，因为解析函数不会像case那样删除重复项 345 BBB nch@xyz.com .

赞(0）回复(0）举报 2021-06-27

omhiaaxx3#

你可以用 group by -> having count() 对于这个要求。

select distinct b.id
    , b.name,
    , b.email
    , b.type 
from table1 b
where id in 
    (select distinct id from table1 group by email, id having count(email) > 1) 
and b.type=1 
order by b.id

赞(0）回复(0）举报 2021-06-27

xmd2e60i4#

您可以尝试使用count（）函数的分析方式：

SELECT sub.ID, sub.NAME
  FROM (SELECT ID, NAME, COUNT (*) OVER (PARTITION BY ID, EMAIL) cnt
          FROM raw.crddacia_raw) sub
 WHERE sub.cnt > 1 AND sub.TYPE = 1

赞(0）回复(0）举报 2021-06-27

wmomyfyw5#

我强烈建议使用窗口函数。但是，hive不支持 count(distinct) 作为一个窗口函数。有不同的方法来解决这个问题。一个是 dense_rank() 学生：

select id, name, email, type, cnt
from (select t1.*,
             (dense_rank() over (partition by id order by email) +
              dense_rank() over (partition by id order by email desc)
             ) as cnt
      from table1 t1
     ) t
where type = 1;

我希望这比你的版本有更好的性能。但是，值得测试不同的版本，看看哪个版本的性能更好（并且可以随时回来让其他人知道哪个更好）。

赞(0）回复(0）举报 2021-06-27

我来回答

hive查询优化

5条答案

相关问题

热门标签

最新问答