如何根据可用数据填充行

sqxo8psd  于 2021-08-09  发布在  Java
关注(0)|答案(2)|浏览(373)

使用snowflake sql。
所以我的表有两列:hour和customerid。每位顾客将有两排,一排对应他/她进入商店的时间,一排对应他/她离开商店的时间。有了这些数据,我想创建一个表,该表包含客户在商店的每一个小时。例如,一个客户x在下午1点进入商店,下午5点离开,因此将有5行(每小时1行),如下面的屏幕截图所示。
我的尝试是:

select
    hour
    ,first_value(customer_id) over (partition by customer_id order by hour rows between unbounded preceding and current row) as customer_id
FROM table

xe55xuns

xe55xuns1#

在snowflake中,通常使用一个数字表来解决这个问题。你可以用 table (generator ...) 生成此类派生表的语法,然后将其与聚合查询联接,聚合查询使用不等式条件计算每个客户端的小时边界:

select t.customer_id, dateadd(hour, n.rn, t.min_hour) final_hour
from (
    select t.customer_id, min(t.hour) min_hour, max(t.hour) max_hour 
    from mytable t
    group by t.customer_id
) t
inner join (
    select row_number() over(order by null) - 1 rn 
    from table (generator(rowcount => 24))
) n on dateadd(hour, n.rn, t.min_hour) <= t.max_hour
order by customer_id, final_hour

这将处理每个客户最多24小时的访问。如果需要更多,则可以增加表生成器的参数。

5lhxktic

5lhxktic2#

因此,对于测试数据中显示的示例情况,只有一天的数据,gmb的解决方案可以很好地工作。
一旦你进入许多天(可以/不可以有重叠的商店访问,让我们假装你不能在商店过夜)
可通过以下方式固定:

select t.hour::date, t.customer_id, min(t.hour) min_hour, max(t.hour) max_hour 
from mytable t
group by 1,2

但多个条目需要标签数据,如:

with mytable as (
  select * from values 
    ('2019-04-01 09:00:00','x','in')
    ,('2019-04-01 15:00:00','x','out')
    ,('2019-04-02 12:00:00','x','in')
    ,('2019-04-02 14:00:00','x','out')
   v(hour, customer_id, state)
)

或者可以推断:

with mytable as (
  select * from values ('2019-04-01 09:00:00','x','in'),('2019-04-01 15:00:00','x','out')
     ,('2019-04-02 12:00:00','x','in'),('2019-04-02 14:00:00','x','out')
   v(hour, customer_id, state)
)
select hour::date as day
    ,hour
    ,customer_id
    ,state
    ,BITAND(row_number() over(partition by day, customer_id order by hour), 1) = 1 AS in_dir
from mytable
order by 3,1,2;

给:

DAY           HOUR                   CUSTOMER_ID    STATE    IN_DIR
2019-04-01    2019-04-01 09:00:00    x              in       TRUE
2019-04-01    2019-04-01 15:00:00    x              out      FALSE
2019-04-02    2019-04-02 12:00:00    x              in       TRUE
2019-04-02    2019-04-02 14:00:00    x              out      FALSE

现在可以使用滞后和限定来获得可以处理多个条目的真实范围:

select customer_id
    ,day
    ,hour
    ,lead(hour) over (partition by customer_id, day order by hour) as exit_time
from infer_direction
qualify in_dir = true

它的工作原理是,为每天/客户的所有行获取下一次的时间,然后(通过qualify)只保留“in”行。
然后我们可以加入到一天中的时间:

select dateadd('hour', row_number() over(order by null) - 1, '00:00:00'::time) as hour
from table (generator(rowcount => 24))

因此,这一切编织在一起

with mytable as (
  select hour::timestamp as hour, customer_id, state 
  from values 
     ('2019-04-01 09:00:00','x','in')
     ,('2019-04-01 12:00:00','x','out')
     ,('2019-04-02 13:00:00','x','in')
     ,('2019-04-02 14:00:00','x','out')
     ,('2019-04-02 9:00:00','x','in')
     ,('2019-04-02 10:00:00','x','out')
   v(hour, customer_id, state)
), infer_direction AS (
  select hour::date as day
      ,hour::time as hour
      ,customer_id
      ,state
      ,BITAND(row_number() over(partition by day, customer_id order by hour), 1) = 1 AS in_dir
  from mytable
), visit_ranges as (
  select customer_id
      ,day
      ,hour
      ,lead(hour) over (partition by customer_id, day order by hour) as exit_time
  from infer_direction
  qualify in_dir = true
), time_of_day AS (
    select dateadd('hour', row_number() over(order by null) - 1, '00:00:00'::time) as hour
    from table (generator(rowcount => 24))
)
select t.customer_id
    ,t.day
    ,h.hour
from visit_ranges as t
join time_of_day h on h.hour between t.hour and t.exit_time
order by 1,2,3;

我们得到:

CUSTOMER_ID    DAY           HOUR
x              2019-04-01    09:00:00
x              2019-04-01    10:00:00
x              2019-04-01    11:00:00
x              2019-04-01    12:00:00
x              2019-04-02    09:00:00
x              2019-04-02    10:00:00
x              2019-04-02    13:00:00
x              2019-04-02    14:00:00

相关问题