hadoop查询,用于比较行值和组值,条件为

sr4lhrrt  于 2021-05-29  发布在  Hadoop
关注(0)|答案(2)|浏览(280)

我希望将一些r代码移植到hadoop中,以便与impala或具有类似sql查询的hive一起使用。我的代码基于这个问题:
r数据表:比较行值和组值,条件
对于每一行,查找子组1中具有相同id且价格较低的行数的点。
假设我有以下数据:

CREATE TABLE project
(
    id int,
    price int, 
    subgroup int
);

INSERT INTO project(id,price,subgroup) 
VALUES
    (1, 10, 1), 
    (1, 10, 1), 
    (1, 12, 1),
    (1, 15, 1),
    (1,  8, 2),
    (1, 11, 2),
    (2,  9, 1),
    (2, 12, 1),
    (2, 14, 2),
    (2, 18, 2);

现在,对于子组1中的行,以下查询在impala中运行良好:

select *, rank() over (partition by id order by price asc) - 1 as cheaper
from project
where subgroup = 1

但是我还需要处理第二组的行。
所以我希望得到的结果是:

id  price   subgroup   cheaper
1   10      1          0 ( because no row is cheaper in id 1 subgroup 1)
1   10      1          0 ( because no row is cheaper in id 1 subgroup 1)
1   12      1          2 ( rows 1 and 2 are cheaper)
1   15      1          3
1    8      2          0 (nobody is cheaper in id 1 and subgroup 1)
1   11      2          2
2    9      1          0
2   12      1          1
2   14      2          2
2   18      2          2
h43kikqp

h43kikqp1#

我们可以尝试以下方法query:-

select * from 
    (
    select *, rank() over (partition by id order by price asc) - 1 as cheaper
    from project
    where subgroup = 1 union
    select *, rank() over (partition by id order by price asc) - 1 as cheaper
    from project
    where subgroup = 2) result
wydwbb8l

wydwbb8l2#

不久前我遇到了完全相同的问题。就像你需要一个窗口功能,你可以把一个 where 中的子句。为了解决这个问题,我将price收集到一个数组(其中subgroup=1)中,并自联接表。然后我编写了一个udf来过滤给定 predicate 的数组。
自定义项:

package somepkg;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import java.util.ArrayList;

public class FilterArrayUDF extends UDF {
    public ArrayList<Integer> evaluate(ArrayList<Text> arr, int p) {
        ArrayList<Integer> newList = new ArrayList<Integer>();

        for (i = 0; i < arr.size(); i++) {
            int elem = Integer.parseInt((arr.get(i)).toString());
            if (elem < p)
                newList.add(elem);
        }
        return newList;
    }
}

当你有了过滤后的数组,你就可以得到它的大小。
查询:

add jar /path/to/jars/hive-udfs.jar;
create temporary function filter_arr as 'somepkg.FilterArrayUDF';

select B.id, price, subgroup, price_arr
  , filter_arr(price_arr, price) cheaper_arr
  , size(filter_arr(price_arr, price)) cheaper
from db.tbl B
join (
  select id, collect_list(price) price_arr
  from db.tbl
  where subgroup = 1
  group by id ) A
on B.id = A.id

输出:

1    10    1    [10,10,12,15]    []               0
1    10    1    [10,10,12,15]    []               0
1    12    1    [10,10,12,15]    [10,10]          2
1    15    1    [10,10,12,15]    [10,10,12]       3
1    8     2    [10,10,12,15]    []               0
1    11    2    [10,10,12,15]    [10,10]          2
2    9     1    [9,12]           []               0
2    12    1    [9,12]           [9]              1
2    14    2    [9,12]           [9,12]           2
2    18    2    [9,12]           [9,12]           2

相关问题