Pandas Dataframe 中所有行对的嵌套循环

6psbrbz9  于 2023-03-06  发布在  其他
关注(0)|答案(6)|浏览(157)

我有一个 Dataframe 在以下格式与~ 80 K行。

df = pd.DataFrame({'Year': [1900, 1902, 1903], 'Name': ['Tom', 'Dick', 'Harry']})

   Year   Name
0  1900    Tom
1  1902   Dick
2  1903  Harry

我需要调用一个函数,将name列的每个组合作为参数。目前我正在使用以下代码(用print替换函数调用):

for i, n1 in enumerate(df.itertuples()):
    for n2 in df[i:].itertuples():
        print(n1.Name, n2.Name)

有什么方法可以加快我所错过的吗?
PS:我需要跟踪每个名称对的索引。所以如果我在索引上运行itertools.combinations,那么我仍然必须进行代价高昂的df.loc调用。

djp7away

djp7away1#

跟踪指数/年份的另一个解决方案是使用交叉连接:

import pandas as pd 

df = pd.DataFrame({'Year': [1900, 1902, 1903], 'Name': ['Tom', 'Dick', 'Harry']})
df = df.reset_index()
print(df.join(df, how='cross', lsuffix='_1', rsuffix='_2'))

输出:

index_1  Year_1 Name_1  index_2  Year_2 Name_2
0        0    1900    Tom        0    1900    Tom
1        0    1900    Tom        1    1902   Dick
2        0    1900    Tom        2    1903  Harry
3        1    1902   Dick        0    1900    Tom
4        1    1902   Dick        1    1902   Dick
5        1    1902   Dick        2    1903  Harry
6        2    1903  Harry        0    1900    Tom
7        2    1903  Harry        1    1902   Dick
8        2    1903  Harry        2    1903  Harry
bvjxkvbb

bvjxkvbb2#

您可以用途:

out = (df.reset_index().merge(df.reset_index(), how='cross')
         .query('index_x <= index_y'))
print(out)

# Output
   index_x  Year_x Name_x  index_y  Year_y Name_y
0        0    1900    Tom        0    1900    Tom
1        0    1900    Tom        1    1902   Dick
2        0    1900    Tom        2    1903  Harry
4        1    1902   Dick        1    1902   Dick
5        1    1902   Dick        2    1903  Harry
8        2    1903  Harry        2    1903  Harry
q8l4jmvw

q8l4jmvw3#

您可以使用itertools.combinations

import pandas as pd 
from itertools import combinations

df = pd.DataFrame({'Year': [1900, 1902, 1903], 'Name': ['Tom', 'Dick', 'Harry']})

for c in combinations(df['Name'], 2):
    print(c)

输出:

('Tom', 'Dick')
('Tom', 'Harry')
('Dick', 'Harry')

或者如果您需要更换(编辑:跟踪索引):

from itertools import combinations_with_replacement
for (i1, n1), (i2, n2) in combinations_with_replacement(df.reset_index()[['index', 'Name']].values, 2):
    print(f"{i1}: {n1}, {i2}: {n2}")

输出:

0: Tom, 0: Tom
0: Tom, 1: Dick
0: Tom, 2: Harry
1: Dick, 1: Dick
1: Dick, 2: Harry
2: Harry, 2: Harry
qc6wkl3g

qc6wkl3g4#

如果需要针对combinations_with_replacement等输出的panda解决方案,可以使用交叉连接,然后按掩码过滤必要的行:

df = pd.DataFrame({'Year': [1900, 1902, 1903], 'Name': ['Tom', 'Dick', 'Harry']})

mask = np.ravel(np.triu(np.ones((len(df),len(df)), dtype=bool)))
df1 = df.reset_index()
out = df1.merge(df1, how='cross')[mask]
print(out)
   index_x  Year_x Name_x  index_y  Year_y Name_y
0        0    1900    Tom        0    1900    Tom
1        0    1900    Tom        1    1902   Dick
2        0    1900    Tom        2    1903  Harry
4        1    1902   Dick        1    1902   Dick
5        1    1902   Dick        2    1903  Harry
8        2    1903  Harry        2    1903  Harry

解决方案适用于任何指数:

df = pd.DataFrame({'Year': [1900, 1902, 1903], 
                   'Name': ['Tom', 'Dick', 'Harry']}, 
                   index=list('abc'))
print(df)
   Year   Name
a  1900    Tom
b  1902   Dick
c  1903  Harry

mask = np.ravel(np.triu(np.ones((3,3), dtype=bool)))
df1 = df.reset_index()
df = df1.merge(df1, how='cross')[mask]
print(df)
  index_x  Year_x Name_x index_y  Year_y Name_y
0       a    1900    Tom       a    1900    Tom
1       a    1900    Tom       b    1902   Dick
2       a    1900    Tom       c    1903  Harry
4       b    1902   Dick       b    1902   Dick
5       b    1902   Dick       c    1903  Harry
8       c    1903  Harry       c    1903  Harry
bgtovc5b

bgtovc5b5#

受@Tranbi和combinations_with_replacement启发的另一种低内存占用方式(避免生成所有组合然后丢弃它们):

from itertools import combinations_with_replacement

# Enhanced by @mozway
a, b = map(list, zip(*combinations_with_replacement(df.index, 2)))
out = pd.concat([df.loc[a].reset_index(), 
                 df.loc[b].reset_index().add_suffix('2')],
                axis=1)
print(out)

# Output
   index  Year   Name  index2  Year2  Name2
0      0  1900    Tom       0   1900    Tom
1      0  1900    Tom       1   1902   Dick
2      0  1900    Tom       2   1903  Harry
3      1  1902   Dick       1   1902   Dick
4      1  1902   Dick       2   1903  Harry
5      2  1903  Harry       2   1903  Harry
57hvy0tb

57hvy0tb6#

基于numpy.broadcast_arrays的另一种可能的解决方案是:

b = a = df['Name'].values.astype(str)
i1 = i2 = df.index.values

a, b, i1, i2 = a[:, None], b[None, :], i1[:, None], i2[None, :]

indexes = np.stack(np.broadcast_arrays(i1, i2), axis=-1).reshape(-1,2)
names = np.stack(np.broadcast_arrays(a, b), axis=-1).reshape(-1,2)

_, idx = np.unique(indexes, return_index=True, axis=0)
indexesu = indexes[idx, :]
namesu = names[idx, :]
m = indexesu[:,0] <= indexesu[:,1]

result = np.hstack([indexesu[m, :], namesu[m, :]])
out = pd.DataFrame(result, columns=['i1', 'i2', 'Name1', 'Name2'])
out = out.assign(**out[['i1', 'i2']].astype(int))

输出:

i1  i2  Name1  Name2
0   0   0    Tom    Tom
1   0   1    Tom   Dick
2   0   2    Tom  Harry
3   1   1   Dick   Dick
4   1   2   Dick  Harry
5   2   2  Harry  Harry

相关问题