如何在bigquery中使用lag和over来计算周移动平均值的百分比变化?

8ftvxx2r  于 2021-07-24  发布在  Java
关注(0)|答案(2)|浏览(719)

我正在用bigquery usa facts covid-19开放数据集做一个项目。数据如下所示:

我正在尝试创建一个查询,该查询将按县对7天covid案例数据进行百分比更改(上下)。最终的结果将是县,日期和7天移动平均数的变化百分比。最终,这将使我能够展示哪些病例相对稳定,哪些病例在增加,也就是热点。
我对滞后和过度工作还不熟悉。所以我很确定我的cte中缺少了一些基本的order by或group by。
这很奇怪,因为当我只选择了一个县(其中县\u name=“x”),我就能得到7天的移动平均线——它只是给了我一个很好的百分比,告诉我它是增加还是减少。问题是,当我不是只选择一个县时,我就是无法弄清楚我需要做什么,或者我需要改变什么,以便仍然获得相同的价值。我最终得到的价值观毫无意义。我很确定这是因为我只是用错了窗口函数。
这是我的密码:

WITH 
a AS (SELECT long.*,
deaths-lag(deaths) over (order by date) as deaths_increase,
confirmed_cases - lag(confirmed_cases) over (order by date) as cases_increase,
FROM `bigquery-public-data.covid19_usafacts.summary` as long
where date >= cast('2020-05-03' as date)

)
,b as (
 SELECT
 a.*,
 AVG(a.deaths_increase) OVER(ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_deaths,
 AVG(a.cases_increase) OVER(ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_cases
FROM a
order by a.county_name

)

select 
b.county_name, 
b.county_fips_code,
b.confirmed_cases,
b.cases_increase,
b.deaths,
b.state,
b.seven_day_avg_cases,
b.date,

(b.seven_day_avg_cases - lag(b.seven_day_avg_cases) OVER( ORDER BY b.date)) / b.seven_day_avg_cases * 100 as seven_day_percent_change

from b

where seven_day_avg_cases > 0

order by date desc
3lxsmp7m

3lxsmp7m1#

下面是bigquery标准sql
你应该加上 PARTITION BY county_name 查询中的所有(…)语句
之后,您的查询可以如下所示


# standardSQL

WITH a AS (
  SELECT long.*,
    deaths-lag(deaths) OVER(PARTITION BY county_name ORDER BY DATE) AS deaths_increase,
    confirmed_cases - LAG(confirmed_cases) OVER (PARTITION BY county_name ORDER BY DATE) AS cases_increase,
  FROM `bigquery-public-data.covid19_usafacts.summary` AS long
  WHERE DATE >= CAST('2020-05-03' AS DATE)
), b AS (
  SELECT a.*,
    AVG(a.deaths_increase) OVER(PARTITION BY county_name ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_deaths,
    AVG(a.cases_increase) OVER(PARTITION BY county_name ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_cases
  FROM a
)
SELECT 
  b.county_name, 
  b.county_fips_code,
  b.confirmed_cases,
  b.cases_increase,
  b.deaths,
  b.state,
  b.seven_day_avg_cases,
  b.date,
  (b.seven_day_avg_cases - LAG(b.seven_day_avg_cases) OVER(PARTITION BY county_name ORDER BY b.date)) / b.seven_day_avg_cases * 100 AS seven_day_percent_change
FROM b
WHERE seven_day_avg_cases > 0
ORDER BY DATE DESC, county_name

注意:显然假设您的原始查询确实适用于一个国家
你的问题的另一个弱点是 ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW -这将窗口设置为7个连续的行(而不是天),这意味着只有在统计数据中包含所有天时,此窗口才起作用-这很可能是此数据的情况。但更恰当的用法是 ORDER BY UNIX_DATE(a.date) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW 取而代之的是,这可以保证你使用7天的窗口,即使有些天因为任何原因而丢失或过滤掉了,等等。

tcbh2hod

tcbh2hod2#

您需要添加partitionby:lag(whatever)over(partitionbycountyname orderbydate)。指定了partitionby之后,您将在该countyname上按日期排序。

相关问题