在sql中创建互斥分组(带对的表)

pjngdqdw  于 2021-08-09  发布在  Java
关注(0)|答案(1)|浏览(345)

正在寻找一些查询结构帮助。我有一个表,其中包含链接时间戳、用户标识、链接标识、键入if链接的行。这些链接类型是例如“email”和“phone number”,因此在下面的示例中,您可以看到用户1没有直接连接到用户3,而是通过用户2。另一个复杂之处是每个“链接帐户”也出现在r1中,这意味着有几个“重复”字段(在示例中:第1+2行、第3+4行)
前任:

Link time          user id   linked_id   link type
---------------------------------------------------
link_occurred at   user 1    user 2      link a 
link_occurred at   user 2    user 1      link a
link_occurred at   user 2    user 3      link b
link_occurred at   user 3    user 2      link b 
link_occurred_at   user 4    user 5      link a
link_occurred_at   user 5    user 4      link a

我可以使用什么函数来获取第一个用户id、所有(直接+间接)链接帐户的计数以及可能的链接帐户id数组。
例如,我希望这里的输出是:

initial user - Count linked accounts  array of linked accounts 
--------------------------------------------------------------
user 1         2 linked               [user 2, user 3]
user 4         1 linked account       [user 5]

这将使我对所有链接的帐户网络进行互斥分组。

bbuxkriu

bbuxkriu1#

直到erwinbrandstetter在上面的评论中提到递归cte,我才知道。这个概念听起来是这样的:一个引用自身的cte,它有一个基本情况,这样递归就终止了。对于您的问题,递归cte解决方案可能类似于:

WITH accumulate_users AS (
  -- Base case: the direct links from a user_id.
  SELECT
    user_id AS user_id, 
    ARRAY_AGG(linked_id) AS linked_accounts
  FROM your_table
  GROUP BY user_id

  UNION ALL

  -- Recursive case: transitively linked accounts.
  SELECT
    ARRAY_UNION(
      accumulate_users.linked_accounts,
      ARRAY_AGG(DISTINCT your_table.linked_id)
    ) AS linked_accounts
  FROM accumulate_users
  JOIN your_table ON CONTAINS(accumulate_users.linked_accounts, your_table.user_id)
  GROUP BY accumulate_users.user_id

  -- But there is no enforced termination condition, hopefully it just
  -- ends at some point? This is part of why implementing recursive CTEs
  -- is challenging, I think.
)
SELECT
  user_id,
  CARDINALITY(linked_accounts) AS count_linked_accounts,
  linked_accounts
FROM accumulate_users

但是,我还不能测试这个查询,因为正如在另一个堆栈溢出q&apresto中详述的那样,它不支持递归cte。
通过反复连接回所拥有的表,可以遍历任意数量但有限的链接。类似这样的东西,为了清晰起见,我加入了第二、三、四级链接:

SELECT
  yt1.user_id,
  ARRAY_AGG(DISTINCT yt2.user_id) AS first_degree_links,
  ARRAY_AGG(DISTINCT yt3.user_id) AS second_degree_links,
  ARRAY_AGG(DISTINCT yt3.linked_user) AS fourth_degree_links,
  ARRAY_UNION(
    ARRAY_AGG(DISTINCT yt2.user_id), 
    ARRAY_UNION(ARRAY_AGG(DISTINCT yt3.user_id), ARRAY_AGG(DISTINCT yt3.linked_user))
  ) AS up_to_fourth_degree_links
FROM your_table AS yt1
JOIN your_table AS yt2 ON yt1.linked_user = yt2.user_id
JOIN your_Table AS yt3 ON yt2.linked_user = yt3.user_id
GROUP BY yt1.user_id

我一直在处理一组类似的数据,尽管我将原始标识符作为原始数据集的一部分。换句话说,你的例子中的“email”和“phone number”。我发现创建一个通过以下连接标识符对用户ID进行分组的表很有帮助:

CREATE TABLE email_connections AS
SELECT
  email,
  ARRAY_AGG(DISTINCT user_id) AS users
FROM source_table
GROUP BY email

然后,通过查找用户数组之间的交点,可以计算相同的任意但有限深度的链接集:

SELECT
    3764350 AS user_id,
    FLATTEN(ARRAY_AGG(ARRAY_UNION(emails1.users, ARRAY_UNION(emails2.users, ARRAY_UNION(emails3.users, emails4.users))))) AS all_users,
    CARDINALITY(FLATTEN(ARRAY_AGG(ARRAY_UNION(emails1.users, ARRAY_UNION(emails2.users, ARRAY_UNION(emails3.users, emails4.users)))))) AS count_all_users
FROM email_connections AS emails1
JOIN email_connections AS emails2 ON CARDINALITY(ARRAY_INTERSECT(emails1.users, emails2.users)) > 0
JOIN email_connections AS emails3 ON CARDINALITY(ARRAY_INTERSECT(emails2.users, emails3.users)) > 0
JOIN email_connections AS emails4 ON CARDINALITY(ARRAY_INTERSECT(emails3.users, emails4.users)) > 0
WHERE CONTAINS(emails1.users, 3764350)
GROUP BY 1

对于neo4j或janusgraph这样的图形数据库技术,计算到任意深度的链接是一个很好的用例。这就是我现在要解决的“用户链接”问题。

相关问题