用标志设置为1的最后一个值替换值

pb3s4cty  于 2021-08-13  发布在  Java
关注(0)|答案(2)|浏览(402)

我有一个表,其中每一行包含在某个事件中更改的所有字段,以及一个与每个字段相关联的标志,如果该字段被更新,则标记该字段。为了简单起见,我在这里只显示“status”字段,但它们也是其他几个字段。在给定字段未被事件修改的情况下,该字段设置为null,标志也设置为null。

+----+---------------------+--------+---------------------+
| id |        date         | status | flag_changed_status |
+----+---------------------+--------+---------------------+
|  1 | 2020-01-03 19:32:17 | TODO   |                   1 |
|  1 | 2020-01-08 15:46:07 | WIP    |                   1 |
|  1 | 2020-01-08 15:53:53 |        |                     | //this line was generated because another field changed
|  1 | 2020-01-08 15:56:53 |        |                     | //this line was generated because another field changed
|  1 | 2020-01-08 16:02:31 | Done   |                   1 |
+----+---------------------+--------+---------------------+

我的目标是将未更改字段的行的字段值替换为标志为1时的最后一个值,例如get:

+----+---------------------+--------+---------------------+
| id |        date         | status | flag_changed_status |
+----+---------------------+--------+---------------------+
|  1 | 2020-01-03 19:32:17 | TODO   |                   1 |
|  1 | 2020-01-08 15:46:07 | WIP    |                   1 |
|  1 | 2020-01-08 15:53:53 | WIP    |                     |
|  1 | 2020-01-08 15:56:53 | WIP    |                     |
|  1 | 2020-01-08 16:02:31 | Done   |                   1 |
+----+---------------------+--------+---------------------+

我知道我想用 last_value bigquery中的分析函数,我尝试了:

SELECT ID_DEMANDE, date, status, 
last_value(status) OVER (ORDER BY flag_changed_status, DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as current_status, flag_changed_status 
FROM table ORDER BY  id, DATE

其思想是,通过在order by函数中使用标志,标志设置为null的行将首先放入,然后最后一个\u值(status)将是标志\u changed\u status设置为1的最后一个值
但这只能在无界前向行和无界后向行之间使用,因为 ORDER BY 子句将在window frame子句(rows between…)之前处理,因此对于flag\u changed\u status为null的行,在order by处理之后,当前行号为0,因此无界前一行和当前行之间的最后一个值始终为null。
有没有办法先运行无界前一行和无界后一行之间的行,然后运行order by,这样last\ U value(status)将返回标志设置为1的当前行之前的最后一个值?还是有更简单的方法,仍然使用分析函数来完成一个查询中的所有不同字段?
编辑:我真的想复制上次设置标志时设置的状态,即使此状态为空,这就是为什么我尝试在order by中使用该标志。也就是说,如果初始表是:

+----+---------------------+--------+---------------------+
| id |        date         | status | flag_changed_status |
+----+---------------------+--------+---------------------+
|  1 | 2020-01-03 19:32:17 | TODO   |                   1 |
|  1 | 2020-01-08 15:46:07 | null   |                   1 |
|  1 | 2020-01-08 15:53:53 | null   |                null |
|  1 | 2020-01-08 15:56:53 | null   |                null |
|  1 | 2020-01-08 15:57:53 | WIP    |                   1 |
|  1 | 2020-01-08 15:58:53 | null   |                null |
|  1 | 2020-01-08 16:02:31 | Done   |                   1 |
+----+---------------------+--------+---------------------+

我需要:

+----+---------------------+--------+---------------------+
| id |        date         | status | flag_changed_status |
+----+---------------------+--------+---------------------+
|  1 | 2020-01-03 19:32:17 | TODO   |                   1 |
|  1 | 2020-01-08 15:46:07 | null   |                   1 |
|  1 | 2020-01-08 15:53:53 | null   |                null | // we copy the last status where the flag was 1, and it is null
|  1 | 2020-01-08 15:56:53 | null   |                null |
|  1 | 2020-01-08 15:57:53 | WIP    |                   1 |
|  1 | 2020-01-08 15:58:53 | WIP    |                null | //only this line changes
|  1 | 2020-01-08 16:02:31 | Done   |                   1 |
+----+---------------------+--------+---------------------+

但这似乎太复杂了,所以我将用一个自定义状态替换标志设置为1的所有空值,然后像@gordon linoff所建议的那样,使用一个简单的last_值(status ignore nulls)将提供几乎所需的结果

t8e9dugd

t8e9dugd1#

下面是bigquery标准sql


# standardSQL

SELECT * EXCEPT(grp),
  LAST_VALUE(status IGNORE NULLS) OVER (PARTITION BY grp ORDER BY date) AS updated_status
FROM (
  SELECT *,
    COUNTIF(flag_changed_status = 1) OVER(ORDER BY `date`) grp
  FROM `project.dataset.table`
)

如果要应用到问题的样本数据-结果是

Row id  date                status  flag_changed_status updated_status   
1   1   2020-01-03 19:32:17 TODO    1                   TODO     
2   1   2020-01-08 15:46:07 null    1                   null     
3   1   2020-01-08 15:53:53 null    null                null     
4   1   2020-01-08 15:56:53 null    null                null     
5   1   2020-01-08 15:57:53 WIP     1                   WIP  
6   1   2020-01-08 15:58:53 null    null                WIP  
7   1   2020-01-08 16:02:31 Done    1                   Done
4jb9z9bj

4jb9z9bj2#

我更喜欢 lag(ignore nulls) . 但bigquery不支持这一点。相反,使用 first_value() / last_value() :

with t as (
      select 1 as id, '2020-01-03 19:32:17' as date, 'TODO' as status, 1 as file_changed_status union all
      select 1 as id, '2020-01-08 15:46:07' as date, 'WIP ' as status, 1 as file_changed_status union all
      select 1 as id, '2020-01-08 15:53:53' as date, null as status, null  as file_changed_status union all
      select 1 as id, '2020-01-08 15:56:53' as date, null as status,  null as file_changed_status union all
      select 1 as id, '2020-01-08 16:02:31' as date, 'Done' as status, 1 as file_changed_status
     )
select t.*,
       last_value(status ignore nulls) over (order by date) as imputed_status
from t;

相关问题