基于group by计算空值的百分比

4xrmg8kj  于 2021-06-25  发布在  Hive
关注(0)|答案(1)|浏览(630)

我正在尝试修改这个基于group by计算空值的解决方案(第一个答案),唯一的区别是我想要它们的百分比(例如,2016年第1列的30%为空),而不是计数(例如,2016年第1列的6521为空)。我的问题:

WITH t1nulltest AS
( select date_column
,SUM(IF(c1 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c1null
,SUM(IF(c2 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c2null
,SUM(IF(c3 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c3null
,SUM(IF(c4 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c4null
,SUM(IF(c5 IS NULL,1,0))/count(*) OVER (PARTITION BY date_column) as c5null
,row_number() OVER (PARTITION BY date_column) as rowno
from t1) 
select 
  date_column, c1null, c2null,c3null,c4null,c5null from t1nulltest
  where rowno =1;

与原始解决方案的唯一区别是 /count(*) ,但这行不通,我想知道为什么。原始查询工作正常。我的查询给出了错误:

Error while compiling statement: FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: line 2:68 Expression not in GROUP BY key 'date_column'
dnph8jn4

dnph8jn41#

我怀疑整个逻辑可以简化为一个简单的聚合查询。

select
    date_column,
    1.0 * (count(*) - count(c1)) / count(*) c1_null_ratio,
    1.0 * (count(*) - count(c2)) / count(*) c2_null_ratio,
    1.0 * (count(*) - count(c3)) / count(*) c3_null_ratio,
    1.0 * (count(*) - count(c4)) / count(*) c4_null_ratio,
    1.0 * (count(*) - count(c5)) / count(*) c5_null_ratio
from t1
group by date_column

这也可以用 avg() :

select
    date_column,
    avg(case when c1 is null then 1 else 0 end) c1_null_ratio,
    avg(case when c2 is null then 1 else 0 end) c2_null_ratio,
    avg(case when c3 is null then 1 else 0 end) c3_null_ratio,
    avg(case when c4 is null then 1 else 0 end) c4_null_ratio,
    avg(case when c5 is null then 1 else 0 end) c5_null_ratio,
from t1
group by date_column

相关问题