使用范围联接减少记录数

r7s23pms  于 2021-06-26  发布在  Hive
关注(0)|答案(1)|浏览(354)

在我的问题之后,我有以下表格第一个(范围)包括值的范围和附加列:

row  | From   |  To     | Country ....
-----|--------|---------|---------
1    | 1200   |   1500  |
2    | 2200   |   2700  |
3    | 1700   |   1900  |
4    | 2100   |   2150  |
...

这个 From 以及 To 是bigint和排他性的。范围表包括180万条记录。其他表(值)包含270万条记录,如下所示:

row     | Value  | More columns....
 --------|--------|----------------
    1    | 1777   |    
    2    | 2122   |    
    3    | 1832   |    
    4    | 1340   |    
    ...

我想创建一个表,如下所示:

row      | Value  | From   | To    | More columns....
 --------|--------|--------|-------|---
    1    | 1777   | 1700   | 1900  |
    2    | 2122   | 2100   | 2150  |   
    3    | 1832   | 1700   | 1900  |   
    4    | 1340   | 1200   | 1500  |   
    ...

我在以下代码中使用了左外连接:

set n=1000;

select      v.id
           ,v.val
           ,r.from_val
           ,r.to_val

from      val v
        left outer join    

 (select  r.*
                   ,floor(from_val/${hiveconf:n}) + pe.i    as match_val

            from    val_range r
                    lateral view    posexplode
                                    (
                                        split
                                        (
                                            space
                                            (
                                                cast
                                                (
                                                    floor(to_val/${hiveconf:n}) 
                                                  - floor(from_val/${hiveconf:n}) 

                                                    as int
                                                )
                                            )
                                           ,' '
                                        )
                                    ) pe as i,x
            ) r

            on      floor(v.val/${hiveconf:n})    =
                    r.match_val

where       v.val between r.from_val and r.to_val

order by    v.id       
;

然而,新表中的记录数大大减少了,在270万条记录中,有31k条记录。如果我使用 left outer join ? 我该怎么修?

hjzp0vay

hjzp0vay1#

假设我们有一个 v.id ```
set n=1000;

select v.id
,r.from_val
,r.to_val

from val v

        left join  (select      v.id
                               ,r.from_val
                               ,r.to_val

                    from                val     v 

                                join    (...)   r 

                                on      floor(v.val/${hiveconf:n})    =
                                        r.match_val

                    where       v.val between r.from_val and r.to_val
                    ) r

        on          r.id    =
                    v.id

order by v.id

对于op请求,下面是完整的查询:

set n=1000;

select v.id
,r.from_val
,r.to_val

from val v

        left join  (select      v.id
                               ,r.from_val
                               ,r.to_val

                    from                val     v 

                                join   (select  r.*
                                               ,floor(from_val/${hiveconf:n}) + pe.i    as match_val

                                        from    val_range r
                                                lateral view    posexplode
                                                                (
                                                                    split
                                                                    (
                                                                        space
                                                                        (
                                                                            cast
                                                                            (
                                                                                floor(to_val/${hiveconf:n}) 
                                                                              - floor(from_val/${hiveconf:n}) 

                                                                                as int
                                                                            )
                                                                        )
                                                                       ,' '
                                                                    )
                                                                ) pe as i,x
                                        ) r

                                on      floor(v.val/${hiveconf:n})    =
                                        r.match_val

                    where       v.val between r.from_val and r.to_val
                    ) r

        on          r.id    =
                    v.id

order by v.id

相关问题