如何优化配置单元查询,以便从多个表中查找记录总数

hfyxw5xn  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(418)

我必须生成一个报告,该报告将给出表a、b和c中使用配置单元存储的事件的计数总和,并且我的s3存储桶已按组织id分区
例如:表a—记录了john(和其他员工)每天上班的情况表b—记录了john(和其他员工)在工作中打电话或接电话的情况表c—记录了john(和其他员工)在工作中提交的每一笔费用
基本上,我要的是上个月a,b和c中john(雇员id)的总数。如果3个表a、b或c中有一个记录,则每个日期只能有一个记录(如果一个或多个表中有一个日期记录,则计数相加)。所以我的输出是:

Employee id
Employee Name 
Date
Count
123
John
02-Jan-2016
55
123
John
12-Jan-2016
88
123
John
19-Jan-2016
103

我提出的问题是:

select  adcts.employee_name, adcts.employee_id,Total_count as event_count, adcts.event_date  
from   
       (select   coalesce(Evts.employee_id,imps.employee_id,AEvts.employee_id) as   employee_id  
        ,   coalesce(Evts.employee_name,imps.employee_name,AEvts.employee_name) as   employee_name  
        , coalesce(Evts.Event_count,0) + coalesce(Imps.Impression_count,0)   + coalesce (AEvts.Event_Count,0)as Total_Count  
        , coalesce (Evts.event_date,imps.impression_date, AEvts.event_date)   as event_date  
    from  
        (select employee_id, employee_name, count(*) as   Event_count,event_date  
         from mm_events  
         where organization_id = 100048  
         and event_date between '2016-02-01' and '2016-02-04'  
        group by employee_id, employee_name,event_date) Evts  
       full outer join  
        (select employee_id, employee_name, count(*) as Impression_count,   impression_date   
         from mm_impressions  
         where organization_id = 100048  
         and impression_date between '2016-02-01' and '2016-02-04'  
        group by employee_id, employee_name,impression_date) Imps  
        on Evts.employee_id = Imps.employee_id  
       full outer join  
        (select employee_id, employee_name, count(*) as   Event_count,event_date  
         from mm_attributed_events  
         where organization_id = 100048  
         and event_date between '2016-02-01' and '2016-02-04'  
         and event_type = 'click'  
        group by employee_id, employee_name,event_date) AEvts  
     on AEvts.employee_id=Evts.employee_id  
       ) adcts     
join  
        (select distinct c.employee_id from default.t1_meta_dmp c   
         where c.employee_dmp_enabled='inherits'  
         and c.agency_dmp_enabled = 'inherits'  
         and c.agency_status='true'  
         and c.employee_status='true'  
         and c.organization_id = 100048) cc  
on adcts.employee_id=cc.employee_id  
order by adcts.employee_id asc

我有两个问题:
1我有正确的问题吗?2因为我使用的是“完全外部联接”,所以同一个日期有多个条目。有人能提出一个更好的方法来达到这个结果吗?可能有不同的问题

1u4esq0p

1u4esq0p1#

同一个项目有多个条目 date 因为你是按 date 在子查询中,但仅通过 employee_id . 这就是为什么你的记录在加入后会被复制。你应该加上 event_date 以及连接条件。
看来你不需要 FULL JOIN 完全。加入比加入更贵 union all . 使用union all从每个表中选择然后 group by employee_name, employee_id, event_date 和聚合计数():

select employee_id, employee_name, sum(Event_count) as Total_Count , event_date 
    from
    (
    select employee_id, employee_name, count(*) as Event_count, event_date  from mm_events 
    where organization_id = 100048 and event_date between '2016-02-01' and '2016-02-04'
group by employee_id, employee_name, event_date

    union all  
    select employee_id, employee_name, count(*) as Event_count, impression_date as event_date   
    from mm_impressions
     where organization_id = 100048 and impression_date between '2016-02-01' and '2016-02-04' 
group by employee_id, employee_name,impression_date

    union all 
    select employee_id, employee_name, count(*) as Event_count,event_date  
    from mm_attributed_events 
    where organization_id = 100048  and event_date between '2016-02-01' and '2016-02-04'  and event_type = 'click'
group by employee_id, employee_name, event_date
    ) adcts
    group by employee_id, employee_name, event_date

将joinwithcc查询添加到上述查询中。
union all中的所有子查询都将并行运行

相关问题