regex BigQuery使用结构体从宽到长对表进行整形

zlhcx6iw  于 2022-11-26  发布在  其他
关注(0)|答案(1)|浏览(94)

我删除了这篇文章的前一个版本,以代替这篇经过清理的文章,并提供了一个可复制的例子。我有一个如下格式的表格:

WITH wide_stats AS (
  (
    SELECT 
      'joe' name, 'bills' team,
      struct(struct(7 as fga) as o, struct(8 as fga) as d) as t1,
      struct(struct(3 as fga) as o, struct(9 as fga) as d) as t2,
      struct(3 as pts, 9 as ast, 5 as reb) as t3,    
      7 tov, 3 blk
  ) UNION ALL (
    SELECT 'nick' name, 'jets' team,
      struct(struct(12 as fga) as o, struct(13 as fga) as d) as t1,
      struct(struct(15 as fga) as o, struct(22 as fga) as d) as t2,
      struct(13 as pts, 5 as ast, 15 as reb) as t3,    
      75 tov, 23 blk
  )
)

SELECT 
  name, team, metric, SAFE_CAST(value AS FLOAT64) value
FROM (
  SELECT 
    name, team, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
  FROM wide_stats,
  UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(wide_stats), r'{|}', ''))) pair
)

WHERE NOT LOWER(metric) IN ('name', 'team')

并且我正在努力将该表重新调整为以下输出:

name   team       metric   value    
joe   bills     t1_o_fga       7 
joe   bills     t1_d_fga       8
joe   bills     t2_o_fga       3
joe   bills     t2_d_fga       9
joe   bills       t3_pts       3
joe   bills       t3_ast       9
joe   bills       t3_reb       5
joe   bills          tov       7
joe   bills          blk       3
nick   jets     t1_o_fga      12
nick   jets     t1_d_fga      13
nick   jets     t2_o_fga      15
nick   jets     t2_d_fga      22
nick   jets       t3_pts      13
nick   jets       t3_ast       5
nick   jets       t3_reb      15
nick   jets          tov      75
nick   jets          blk      23

这个任务很容易解释--从宽到长,但是在表中有struct和嵌套的struct。我在另一篇Stack Overflow帖子中的正则表达式工作是以错误的方式拆分列名,当前的输出与它所需要的不匹配。
行的顺序并不重要,对于变量名,不管是t1_o_fga还是t1-o-fga还是t1/o/fga,只要有一些分隔符,并且清楚地知道变量是什么就行了。

rjee0c15

rjee0c151#

下面是BigQuery标准SQL的示例

#standardSQL
WITH wide_stats AS (
    SELECT 'joe' name, 'bills' team,
      STRUCT(STRUCT(7 AS fga) AS o, STRUCT(8 AS fga) AS d) AS t1,
      STRUCT(STRUCT(3 AS fga) AS o, STRUCT(9 AS fga) AS d) AS t2,
      STRUCT(3 AS pts, 9 AS ast, 5 AS reb) AS t3, 7 tov, 3 blk UNION ALL 
    SELECT 'nick' name, 'jets' team,
      STRUCT(STRUCT(12 AS fga) AS o, STRUCT(13 AS fga) AS d) AS t1,
      STRUCT(STRUCT(15 AS fga) AS o, STRUCT(22 AS fga) AS d) AS t2,
      STRUCT(13 AS pts, 5 AS ast, 15 AS reb) AS t3, 75 tov, 23 blk
), flat_stats AS (
  SELECT name, team,
    t1.o.fga AS t1_o_fga,
    t1.d.fga AS t1_d_fga,
    t2.o.fga AS t2_o_fga,
    t2.d.fga AS t2_d_fga,
    t3.pts AS t3_pts,
    t3.ast AS t3_ast,
    t3.reb AS t3_reb,
    tov, blk
  FROM wide_stats
)
SELECT name, team, metric, SAFE_CAST(value AS FLOAT64) value 
FROM (
  SELECT name, team, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
  FROM flat_stats, 
  UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(flat_stats), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('name', 'team')

带输出

Row name    team    metric      value    
1   joe     bills   t1_o_fga    7.0  
2   joe     bills   t1_d_fga    8.0  
3   joe     bills   t2_o_fga    3.0  
4   joe     bills   t2_d_fga    9.0  
5   joe     bills   t3_pts      3.0  
6   joe     bills   t3_ast      9.0  
7   joe     bills   t3_reb      5.0  
8   joe     bills   tov         7.0  
9   joe     bills   blk         3.0  
10  nick    jets    t1_o_fga    12.0     
11  nick    jets    t1_d_fga    13.0     
12  nick    jets    t2_o_fga    15.0     
13  nick    jets    t2_d_fga    22.0     
14  nick    jets    t3_pts      13.0     
15  nick    jets    t3_ast      5.0  
16  nick    jets    t3_reb      15.0     
17  nick    jets    tov         75.0     
18  nick    jets    blk         23.0

如果出于某种原因,您在手动组装flat_stats临时表时遇到问题-您可以使用以下小技巧

步骤1-仅在传统模式下使用目标表[project:dataset.flat_stats]运行以下查询

#legacySQL
SELECT *
FROM [project:dataset.wide_stats]

令人惊讶的是,这将创建具有以下结构的表[project:dataset.flat_stats]

Row name    team    t1_o_fga    t1_d_fga    t2_o_fga    t2_d_fga    t3_pts  t3_ast  t3_reb  tov blk  
1   joe     bills   7           8           3           9           3       9       5       7   3    
2   nick    jets    12          13          15          22          13      5       15      75  23

第2步-之后,您只需运行下面的命令(现在使用标准SQL)

#standardSQL
SELECT name, team, metric, SAFE_CAST(value AS FLOAT64) value 
FROM (
  SELECT name, team, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 
    REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
  FROM `project.dataset.flat_stats` flat_stats, 
  UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(flat_stats), r'{|}', ''))) pair
)
WHERE NOT LOWER(metric) IN ('name', 'team')

相关问题