用户定义函数中不允许使用array\u agg(标准sql)

hrysbysz  于 2021-07-24  发布在  Java
关注(0)|答案(1)|浏览(426)

在bigquery上使用一个用户定义的函数从杂乱的数据集中提取电子邮件时,我面临着一个问题:temp user defined function(udf)的主体中不允许使用数组\u agg()。

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (
    ARRAY_AGG(
        DISTINCT 
        (SELECT * FROM 
            UNNEST(
                SPLIT(
                    REPLACE(
                        LOWER(
                            ARRAY_TO_STRING(emails, ",")
                        )," ", ""
                    )
                )
            ) AS e where e like '%@%'
        ) IGNORE NULLS
    )[SAFE_OFFSET(index)]
);

SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bon@yahoo.com"],1) as email_1

我尝试绕过数组\u agg,方法是选择from unnest with offset,然后选择offset作为索引的位置。
但是,现在有一个列限制(在标量子查询select子句中不超过一列)建议改用select as struct。
我尝试了select as结构:

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (

    (SELECT AS STRUCT DISTINCT list.e, list.o FROM 
        UNNEST(
            SPLIT(
                REPLACE(
                    LOWER(
                        ARRAY_TO_STRING(emails, ", ")
                    )," ", ""
                )
            )
        ) AS list
        WITH OFFSET as list.o
        WHERE list.e like '%@%' AND list.o = index)
);

SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bob@yahoo.com"],1) as email_1

但它不喜欢我的distinct,即使删除它,它也会抱怨解析e和o。
所以我没主意了,我可能结了个结。有人能建议如何在自定义项框架内完成这项工作吗?谢谢。

3xiyfsfu

3xiyfsfu1#

以下版本有效

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS ((
    SELECT ARRAY(
        SELECT * 
          FROM UNNEST(
                SPLIT(
                    REPLACE(
                        LOWER(
                            ARRAY_TO_STRING(emails, ",")
                        )," ", ""
                    )
                )
            ) AS e WHERE e LIKE '%@%'
    )[SAFE_OFFSET(index)]
));
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bon@yahoo.com"], 1) AS email_1

有结果的

Row email_1  
1   test@gmail.com

或以下版本(只是对原始查询的轻微更正)

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS ((
  SELECT ARRAY_AGG(e)[SAFE_OFFSET(index)] 
  FROM UNNEST(
        SPLIT(
            REPLACE(
                LOWER(
                    ARRAY_TO_STRING(emails, ",")
                )," ", ""
            )
        )
    ) AS e WHERE e LIKE '%@%'
));
SELECT GET_EMAIL(["bob@hotmail.com,test@gmail.com", "12345", "bon@yahoo.com"], 1) AS email_1

显然是同样的结果

相关问题