Apache Spark IN/EXISTS predicate 子查询只能用于筛选/连接

qyzbxkaa  于 2023-03-19  发布在  Apache
关注(0)|答案(2)|浏览(727)

我有3个视图dataset1currentall

SELECT
     t1.id,
     t1.year,
     t1.day,
     t1.month,
     (CASE WHEN t2.id is null then false else true end) as id_present,
     (CASE WHEN ((NOT (t1.id IN (SELECT DISTINCT id
                                FROM all a
                                WHERE a.dates = current_date - Interval '2' day))) 
                 AND 
                 ((t1.click_date = current_date) OR ((t1.count > 0) AND (t1.click_date < (current_date - INTERVAL  '12' MONTH)))))
            THEN 'Action1' 
            WHEN ((NOT (t1.id IN (SELECT DISTINCT id 
                                   FROM all a
                                   WHERE a.dates = current_date - Interval '2' day)))
                    AND (t1.count > 0) 
                    AND (t1.click_date < (current_date - INTERVAL  '12' MONTH))) 
            THEN 'Action2' 
            ELSE 'Action3' END) actions
FROM dataset1 t1 LEFT JOIN current t2 ON t1.id = t2.id

现在我面临的错误是Exception in thread "main" org.apache.spark.sql.AnalysisException: IN/EXISTS predicate sub-queries can only be used in Filter/Join and a few commands:
这个查询在basic SQL中可以正常工作,但在spark SQL中不行,我知道这是因为我在CASE WHEN中使用了NOT IN。为了避免这个错误,这个查询的替代方法是什么?

vq8itlhq

vq8itlhq1#

在Spark中,
1.如果all中的不同id数量较少,可以使用collect_set()函数创建不同id数组cross join dataset1,然后使用array_contains()函数检查id是否在不同id数组中。
下面是查询:

SELECT
     t1.id,
     year,
     day,
     month,
     (CASE WHEN t2.id is null then false else true end) as id_present,
     (CASE WHEN ((NOT array_contains(dia.distinct_id_array,t1.id))) 
                 AND 
                 ((click_date = current_date) OR ((count > 0) AND (click_date < (current_date - INTERVAL  '12' MONTH)))))
            THEN 'Action1' 
            WHEN ((NOT array_contains(dia.distinct_id_array,t1.id)))
                    AND (count > 0) 
                    AND (click_date < (current_date - INTERVAL  '12' MONTH))) 
            THEN 'Action2' 
            ELSE 'Action3' END
    ) as actions
FROM dataset1 t1 
LEFT JOIN current t2 
ON t1.id = t2.id
CROSS JOIN (
(SELECT collect_set(id) as distinct_id_array
 FROM all
 WHERE dates = current_date - Interval '2' day
) dia

1.如果all中的不同id的数量很大,则需要将left joindataset1和group by一起使用,以便为t1中的每个id生成一个is_in_all标志。
下面是查询:

WITH id_flag (
SELECT
    dataset1.id,
    COUNT(all.id) > 0 as is_in_all
FROM
    dataset1
LEFT JOIN
    all
ON
    dataset1.id = all.id
GROUP BY dataset1.id
)

SELECT
     t1.id,
     year,
     day,
     month,
     (CASE WHEN t2.id is null then false else true end) as id_present,
     (CASE WHEN ((NOT is_in_all)) 
                 AND 
                 ((click_date = current_date) OR ((count > 0) AND (click_date < (current_date - INTERVAL  '12' MONTH)))))
            THEN 'Action1' 
            WHEN ((NOT is_in_all))
                    AND (count > 0) 
                    AND (click_date < (current_date - INTERVAL  '12' MONTH))) 
            THEN 'Action2' 
            ELSE 'Action3' END
    ) as actions
FROM dataset1 t1 
LEFT JOIN current t2 
ON t1.id = t2.id
JOIN id_flag
ON
   t1.id = id_flag.id
arknldoa

arknldoa2#

您可以尝试使用此查询。它是您的查询的修改版本,我已将您的CASE WHEN中的子查询移动到JOIN子句,并将NOT IN替换为NOT EXISTS表达式。

SELECT
     t1.id,
     t1.year,
     t1.day,
     t1.month,
     (CASE WHEN t2.id IS NULL THEN false ELSE true END) AS id_present,
     (CASE WHEN (t3.id IS NOT NULL  
                 AND 
                 ((t1.click_date = current_date) OR ((t1.count > 0) AND (t1.click_date < (current_date - INTERVAL  '12' MONTH)))))
            THEN 'Action1' 
            WHEN (t3.id IS NOT NULL
                  AND (t1.count > 0) 
                  AND (t1.click_date < (current_date - INTERVAL  '12' MONTH))) 
            THEN 'Action2' 
            ELSE 'Action3' END) actions
FROM dataset1 t1 
     LEFT JOIN current t2 ON t1.id = t2.id
     LEFT JOIN 
       (SELECT id
        FROM   dataset1 t
        WHERE  NOT EXISTS 
              (SELECT 1
               FROM all a
               WHERE t.id = a.id AND a.dates = current_date - Interval '2' day)
       ) t3 ON t1.id = t3.id

相关问题