bigquery中语义版本数据类型的处理

oogrdqng  于 2021-07-24  发布在  Java
关注(0)|答案(2)|浏览(326)

我知道bigquery中没有数据类型。在bigquery中,您更喜欢处理什么语义版本?
我有以下模式:

software:string,
software_version:string

软件版本列为 string 但我存储的数据是semver格式的:major.minor.patch-prerelease 我特别想表演操作员<>=` .

select '4.0.0' < '4.0.0-beta'

这是回报 true 但根据塞姆弗的定义,这是假的。因为这个角色 - 用于预发布。

ttisahbt

ttisahbt1#

下面是bigquery标准sql
你可以用 compareSemanticVersion 比较两个语义版本的自定义项
和/或使用 normaizedSemanticVersion 自定义项通过ORDERBY子句对输出进行排序。
请参阅下面的示例,其中包含两个(比较和排序方式)用例


# standardSQL

CREATE TEMP FUNCTION normaizedSemanticVersion(semanticVersion STRING) 
AS ((
  SELECT STRING_AGG(
      IF(isDigit, REPEAT('0', 8 - LENGTH(chars)) || chars, chars), '' ORDER BY grp 
    ) || 'zzzzzzzzzzzzzz' 
  FROM (
    SELECT grp, isDigit, STRING_AGG(char, '' ORDER BY OFFSET) chars,
    FROM (
      SELECT OFFSET, char, isDigit,
        COUNTIF(NOT isDigit) OVER(ORDER BY OFFSET) AS grp
      FROM UNNEST(SPLIT(semanticVersion, '')) AS char WITH OFFSET, 
      UNNEST([char IN ('1','2','3','4','5','6','7','8','9','0')]) isDigit
    )
    GROUP BY grp, isDigit
)));
CREATE TEMP FUNCTION compareSemanticVersions(
  normSemanticVersion1 STRING, 
  normSemanticVersion2 STRING) 
AS ((
  SELECT CASE 
      WHEN v1 < v2 THEN 'v2 newer than v1'
      WHEN v1 > v2 THEN 'v1 newer than v2'
      ELSE 'same versions'
    END
  FROM UNNEST([STRUCT(
    normaizedSemanticVersion(normSemanticVersion1) AS v1, 
    normaizedSemanticVersion(normSemanticVersion2) AS v2
  )])
));
WITH test AS (
  SELECT '1.10.0-alpha' AS v1 , '1.0.0-alpha.1' AS v2 UNION ALL
  SELECT '4.0.0', '4.0.0-beta' UNION ALL
  SELECT '1.0.0-alpha.1'     , '1.0.0-alpha.beta' UNION ALL
  SELECT '1.0.0-alpha.beta'  , '1.0.0-beta' UNION ALL
  SELECT '1.0.0-beta'        , '1.0.0-beta.2' UNION ALL
  SELECT '1.0.0-beta.2'      , '1.0.0-beta.11' UNION ALL
  SELECT '1.0.0-beta.11'     , '1.0.0-rc.1' UNION ALL
  SELECT '1.0.0-rc.1'        , '1.0.0' UNION ALL
  SELECT '1.0.0-alpha-1.1+build1234-a', '1.0.0-alpha-1.1+build1234-a'
)
SELECT v1, v2, compareSemanticVersions(v1, v2) result
FROM test 
ORDER BY normaizedSemanticVersion(v1)

有输出

Row v1                              v2                              result   
1   1.0.0-alpha-1.1+build1234-a     1.0.0-alpha-1.1+build1234-a     same versions    
2   1.0.0-alpha.1                   1.0.0-alpha.beta                v2 newer than v1     
3   1.0.0-alpha.beta                1.0.0-beta                      v2 newer than v1     
4   1.0.0-beta.2                    1.0.0-beta.11                   v2 newer than v1     
5   1.0.0-beta.11                   1.0.0-rc.1                      v2 newer than v1     
6   1.0.0-beta                      1.0.0-beta.2                    v1 newer than v2     
7   1.0.0-rc.1                      1.0.0                           v2 newer than v1     
8   1.10.0-alpha                    1.0.0-alpha.1                   v1 newer than v2     
9   4.0.0                           4.0.0-beta                      v1 newer than v2

注意:在阅读了您提供的参考资料之后,我基于对语义版本控制的理解编写了上述UDF。有一些潜在的边缘情况仍然需要解决。但绝对应该适用于简单的情况,我希望您能够简单地采用这些自定义项,并根据您的特殊需要调整输出,甚至可以优化我在这里使用的结果
另一个仅供参考:在 normaizedSemanticVersion 我正在使用的自定义项 zzzzzzzzzz 只是为了解决一些边缘案件。我试过的另一个选择是 ..zzzzzzzzzz (注意两个额外的点)-我认为这对于更复杂的情况给出了更好的结果-但是我真的没有时间完成测试。请试一试
例如,在语义版本控制页面中有一个示例:1.0.0-alpha<1.0.0-alpha.1<1.0.0-alpha.beta<1.0.0-beta.2<1.0.0-beta.11<1.0.0-rc.1<1.0.0。
按照这个例子的顺序- ..zzzzzzzzzz 应使用-见下文


# standardSQL

CREATE TEMP FUNCTION normaizedSemanticVersion(semanticVersion STRING) 
AS ((
  SELECT STRING_AGG(
      IF(isDigit, REPEAT('0', 8 - LENGTH(chars)) || chars, chars), '' ORDER BY grp 
    ) || '..zzzzzzzzzzzzzz' 
  FROM (
    SELECT grp, isDigit, STRING_AGG(char, '' ORDER BY OFFSET) chars,
    FROM (
      SELECT OFFSET, char, isDigit,
        COUNTIF(NOT isDigit) OVER(ORDER BY OFFSET) AS grp
      FROM UNNEST(SPLIT(semanticVersion, '')) AS char WITH OFFSET, 
      UNNEST([char IN ('1','2','3','4','5','6','7','8','9','0')]) isDigit
    )
    GROUP BY grp, isDigit
)));
CREATE TEMP FUNCTION compareSemanticVersions(
  normSemanticVersion1 STRING, 
  normSemanticVersion2 STRING) 
AS ((
  SELECT
    CASE 
      WHEN v1 < v2 THEN 'v2 newer than v1'
      WHEN v1 > v2 THEN 'v1 newer than v2'
      ELSE 'same versions'
    END
  FROM UNNEST([STRUCT(
    normaizedSemanticVersion(normSemanticVersion1) AS v1, 
    normaizedSemanticVersion(normSemanticVersion2) AS v2
  )])
));
WITH test AS (
  SELECT 1 `order`, '1.0.0-alpha' version UNION ALL
  SELECT 2, '1.0.0-alpha.1' UNION ALL
  SELECT 3, '1.0.0-alpha.beta' UNION ALL
  SELECT 4, '1.0.0-beta' UNION ALL
  SELECT 5, '1.0.0-beta.2' UNION ALL
  SELECT 6, '1.0.0-beta.11' UNION ALL
  SELECT 7, '1.0.0-rc.1' UNION ALL
  SELECT 8, '1.0.0.' 
)
SELECT *
FROM test
ORDER BY normaizedSemanticVersion(version)

与语义版本规范匹配的输出

Row order   version  
1   1   1.0.0-alpha  
2   2   1.0.0-alpha.1    
3   3   1.0.0-alpha.beta     
4   4   1.0.0-beta   
5   5   1.0.0-beta.2     
6   6   1.0.0-beta.11    
7   7   1.0.0-rc.1   
8   8   1.0.0.
qgzx9mmu

qgzx9mmu2#

这不是bigquery的问题。声明 '4.0.0' < '4.0.0-beta' 会回来的 True 在所有编程语言中,这种比较是基于字母顺序的,而字母顺序与语义版本顺序不同。
我建议编写一些自定义udf函数来解决您的问题,或者尝试使用如下sql:

with data as (
select "4.0.0" as version
union all select "4.0.0-beta" as version
)
select 
split(d.version,'-')[offset(0)] as version,
case array_length(SPLIT(d.version,'-')) 
  when 1 then NULL
  when 2 then split(d.version,'-')[offset(1)]
end as prerelease
from data as d
order by version asc, prerelease desc

当然,在比较版本时也要小心,因为在这种情况下,比较不会像您期望的那样工作

with data as (
select "4.0.0" as version
union all select "4.1.0" as version
union all select "4.2.0" as version
union all select "4.10.0" as version
)
select 
split(d.version,'-')[offset(0)] as version,
case array_length(split(d.version,'-')) 
  when 1 then NULL
  when 2 then split(d.version,'-')[offset(1)]
end as patch
from data as d
order by version asc, patch desc

然后你必须将版本分为“主要”、“次要”、“补丁”,并分别按每个元素进行比较。

相关问题