pandas 是否可以跨多个表更有效地进行重复连接？

我遇到了一个问题，涉及到对可能包含数千万行的多个表执行内部连接。我知道这通常不能比使用pandas或polars之类的标准连接优化得多，但我认为对于这种情况有一些警告，可以允许更有效的解决方案-即表彼此相关。
具体地，表是网格配对，其中列包含指示网格在何处与另一网格匹配的索引。看起来像这样：

| A |  B   |
| - | ---- |
| 0 | 1, 2 |
| 1 | 4, 5 |
| 3 | 7    |

字符串
将意味着A[0]匹配B[1]和B[2]; A[1]匹配B[4]和B[5]; A[3]与B[7]匹配。
实际上，我需要处理任意数量的网格，但作为一个例子，假设我有3个网格- A，B，C -这导致3个表/网格配对（3C 2 = [AB，AC，BC]）：

| A |  B   |
| - | ---- |
| 0 | 1, 2 |
| 1 | 4, 5 |
| 3 | 7    |

| A |    C    |
| - | ------- |
| 0 | 1, 2, 3 |
| 1 | 2       |

| B |    C    |
| - | ------- |
| 1 | 1, 2    |
| 2 | 1, 2    |
| 5 | 1, 2, 3 |

型
最后的结果是所有这些表的内部连接，看起来像这样：

| A |  B   |  C   |
| - | ---- | ---- |
| 0 | 1, 2 | 1, 2 |
| 1 | 5    | 2    |

型
或者，等价地：

| A | B | C |
| - | - | - |
| 0 | 1 | 1 |
| 0 | 1 | 2 |
| 0 | 2 | 1 |
| 0 | 2 | 2 |
| 1 | 5 | 2 |

型
这表示在所有网格中同时匹配的索引：A[0]匹配B[1]匹配C[1]，等等。
上面的python例子：

from functools import reduce
import pandas as pd 

def create_table(indices: dict, names: list) -> pd.MultiIndex:
    table = pd.DataFrame.from_dict(indices, orient='index').stack().astype(int)
    index = table.reset_index().set_index(['level_0', 0]).index
    return index.set_names(names)
    
AB = create_table({
    0: [1, 2],
    1: [4, 5],
    3: [7],
}, ['A', 'B'])

AC = create_table({
    0: [1, 2, 3],
    1: [2],
}, ['A', 'C'])

BC = create_table({
    1: [1, 2],
    2: [1, 2],
    5: [1, 2, 3],
}, ['B', 'C'])

join   = lambda df1, df2: df1.join(df2, how='inner')
result = reduce(join, [AB, AC, BC]).reorder_levels(['A', 'B', 'C'])
print(result)

MultiIndex([(0, 1, 1),
            (0, 1, 2),
            (0, 2, 1),
            (0, 2, 2),
            (1, 5, 2)],
           names=['A', 'B', 'C'])

这工作得很好，直到我有几个具有数百万行的表。如果我不强制执行全局匹配约束，那么我使用一个“锚”网格来匹配（A匹配B; A匹配C;不要检查B是否匹配C）-我有一个解决方案，可以在几秒钟内运行，内存需求最小。然而，在一般情况下添加B<=>C检查需要的资源超过一个数量级，因为我现在需要扩展内部列表。
我能看到的唯一方法是将表保持为压缩格式（允许列表成为列元素）-但我需要以某种方式有效地将表转换为压缩格式，例如。

| A |  B   |      | B |  A   |
| - | ---- |  =>  | - | ---- | 
| 0 | 1, 2 |      | 1 | 0, 1 |
| 1 | 1    |      | 2 | 0    |

型
这里是一个polars解决方案，它执行了一个更现实的压力测试：

from sklearn.neighbors import BallTree
from itertools import combinations
from functools import reduce
import numpy as np 
import polars as pl 
import time

def create_table_pl(indices, names: list):
    c1 = np.repeat(np.arange(len(indices), dtype='int64'), list(map(len, indices)) )
    c2 = np.concatenate(indices).astype(dtype='int64')
    print(f'{names}: {len(c2):,} rows')
    return pl.DataFrame(dict(zip(names, [c1,c2]))).set_sorted(names[0]).lazy()

def match(grids): 
    """ Use a BallTree to find close elements between grids """
    return BallTree(grids[1][:, None], p=np.inf).query_radius(grids[0][:, None], 5)

# Previous example
# AB = create_table_pl([[1,2], [4,5], [], [7]], ['A', 'B'])
# AC = create_table_pl([[1,2,3], [2]], ['A', 'C'])
# BC = create_table_pl([[], [1,2], [1,2], [], [], [1,2,3]], ['B', 'C'])
# pairs = [AB, AC, BC]
# names = ['A','B','C']

n = 500000
grids = np.arange(n), np.arange(0,n,2), np.arange(0,n,3), np.arange(0,n,5)
names = list('abcd')
pairs = list(map(create_table_pl, map(match, combinations(grids, 2)), combinations(names, 2)))

start = time.time()
join  = lambda df1, df2: df1.join(df2, on=set(df2.columns).intersection(set(df1.columns)))
table = reduce(join, pairs).select(names).sort(names)

print(table.collect())
print(f'{time.time()-start:.2f} seconds')

('a', 'b'): 2,749,985 rows
('a', 'c'): 1,833,325 rows
('a', 'd'): 1,099,994 rows
('b', 'c'): 916,662 rows
('b', 'd'): 549,997 rows
('c', 'd'): 366,665 rows
shape: (11_183_244, 4)
┌────────┬────────┬────────┬───────┐
│ a      ┆ b      ┆ c      ┆ d     │
│ ---    ┆ ---    ┆ ---    ┆ ---   │
│ i64    ┆ i64    ┆ i64    ┆ i64   │
╞════════╪════════╪════════╪═══════╡
│ 0      ┆ 0      ┆ 0      ┆ 0     │
│ 0      ┆ 0      ┆ 0      ┆ 1     │
│ 0      ┆ 0      ┆ 1      ┆ 0     │
│ 0      ┆ 0      ┆ 1      ┆ 1     │
│ …      ┆ …      ┆ …      ┆ …     │
│ 499999 ┆ 249998 ┆ 166665 ┆ 99999 │
│ 499999 ┆ 249998 ┆ 166666 ┆ 99999 │
│ 499999 ┆ 249999 ┆ 166665 ┆ 99999 │
│ 499999 ┆ 249999 ┆ 166666 ┆ 99999 │
└────────┴────────┴────────┴───────┘
1.66 seconds

或者，整个问题可以等效地表示为一个图，其中最终表中的条目表示图中的所有循环......但我不确定是否会有任何更有效的解决方案。

polars没有多索引，所以这不是一个可行的方式来存储数据在polars。这似乎是工作，虽然我不确定它是否有助于缩放，但polars比pandas更快更有效。我想你应该给予我更多的提示，看看规模是什么样的。是几百万行的ABC还是什么？
不管怎么说，这是：

import polars as pl

plAB=pl.DataFrame({'A':[0,1,3],'B':[[1,2],[4,5],[7]]})
plAC=pl.DataFrame({'A':[0,1],'C':[[1,2,3],[2]]})
plBC=pl.DataFrame({'B':[1,2,5],'C':[[1,2],[1,2],[1,2,3]]})

def alljoin(*dfs):
    """go through all dfs and join them sequentially"""
    for i, df in enumerate(dfs):
        df=df.explode([x for x,y in df.schema.items() if y==pl.List()])
        if i==0:
            progress=df
        else:
            progress = progress.join(df, 
                on=[x for x in df.columns if x in progress.columns])

    progress=progress.select(sorted(progress.columns))
    return progress
            
            
alljoin(plAB, plAC, plBC)

shape: (5, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 0   ┆ 1   ┆ 1   │
│ 0   ┆ 1   ┆ 2   │
│ 0   ┆ 2   ┆ 1   │
│ 0   ┆ 2   ┆ 2   │
│ 1   ┆ 5   ┆ 2   │
└─────┴─────┴─────┘

字符串

pandas 是否可以跨多个表更有效地进行重复连接？

1条答案

相关问题

热门标签

最新问答