填充配置单元中最后一个非空值的空值

d4so4syb  于 2021-05-30  发布在  Hadoop
关注(0)|答案(4)|浏览(340)

我有4列

date   number   Estimate   Client    
----   ------
1      3          10        A 
2      NULL       10        Null
3      5          10        A      
4      NULL       10        Null 
5      NULL       10        Null
6      2          10        A   
.......

我需要将空值替换为新值,该值采用日期列中上一个日期中最后一个已知值的值,例如:date=2 number=3,date 4 and 5 number=5 and 5。空值随机出现。
这需要在Hive里完成。

tkclm6bt

tkclm6bt1#

下面是一个使用标准hiveql连接的解决方案。这应该适用于所有版本的Hive。表c合并了不为空的客户机的最近日期。后面的表d合并了与该日期相关联的编号。使用coalesce仅当数字为null时才使用附加值。

select c.date
, coalesce(c.number,d.number) as number
, c.client
, estimate
from
    (select date
     , max(prior_date) as prior_date  -- nearest date not null number
     , value
     , estimate
     , a.client
     from
         (select date
           , value
           , estimate
           , client 
           from table_have
         ) a
        left outer join
        (select date as prior_date -- dates without nulls
          , client 
          from table_have 
          where number is not null
        ) b
        on a.client=b.client
        where date > prior_dates
        group by a.client, date, value
   ) c
 left outer join
  (select date
   , number
   , client
   from table_have
   where number is not null
  ) d
  on c.client = d.client and c.prior_date=d.date
  group by c.date, c.client, estimate
;

通过使用与备用解决方案类似的公共表表达式,可以对该查询进行更多优化。然而,这个解决方案不需要n次重复的行,应该是通用的。另一种解决方案中所需的数字n可能不是静态的,因为这种解决方案可能适用于更一般的情况。

ni65a41a

ni65a41a2#

这是关于推拉窗的;
这是我的表格内容;

hive> select * from my_table;
OK
1       3       10      A
2       NULL    10      NULL
3       5       10      A
4       NULL    10      NULL
5       NULL    10      NULL
6       2       10      A
Time taken: 0.06 seconds, Fetched: 6 row(s)

您所需要做的就是在precedings和当前行之间的窗口上滑动,找到最近的notnull值。 LAST_VALUE windowable函数有一个参数可以忽略布尔值形式的空值。 LAST_VALUE(<field>,<ignore_nulls> as boolean) ;

SELECT
    COALESCE(`date`, LAST_VALUE(`date`, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
    COALESCE(number, LAST_VALUE(number, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
    COALESCE(estimate, LAST_VALUE(estimate, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
    COALESCE(client, LAST_VALUE(client, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
FROM my_table;

结果将是;

OK
1       3       10      A
2       3       10      A
3       5       10      A
4       5       10      A
5       5       10      A
6       2       10      A
Time taken: 19.177 seconds, Fetched: 6 row(s)
ht4b089n

ht4b089n3#

如果您使用的是sql,那么下面的查询就没有帮助了。否则可以使用ffill和bfill函数。

select primary_key_val,country,
COALESCE(country, LAST_VALUE(country, TRUE) OVER(partition by primary_key_val order**by eff_start_dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) as upd_country,**
eff_start_dt from dim_acct_keys order by primary_key_val,eff_start_dt

一些数据例如:

+------------------+----------+--------------+---------------+
| primary_key_val  | country  | upd_country  | eff_start_dt  |
+------------------+----------+--------------+---------------+
| act1010          | USA      | USA          | 20190101      |
| act1010          | NULL     | USA          | 20190102      |
| act1010          | NULL     | USA          | 20190103      |
| act1012          | USA      | USA          | 20190101      |
| act1012          | NULL     | USA          | 20190102      |
| act1012          | MEX      | MEX          | 20190103      |
qlckcl4x

qlckcl4x4#

这实际上是一个相当棘手的问题,因为hive不支持递归CTE或相关子查询,这是解决此类问题的常用方法。
我能想到的唯一纯Hive方式就是做一堆自我连接。您必须在数据中执行尽可能多的最大长度的连续空值。

--add in row numbers
with T as
(select select *, row_number() over (order by date) rn
 from mytable)
--main query
select T.date, 
       case when T.number is not null then T.number
       else when T1.number is not null then T1.number
       else when T2.number is not null then T2.number end as number
       --repeat this N times 
       --where N is the length of the longest sequece of consectutive nulls 

       -- add in your other columns here
from T 
join T T1 on T1.date = t.date - 1
join T T2 on T2.date = t.date - 2
--repeat this N times

相关问题