使用pandas创建频率矩阵

oo7oh9g9 于 2023-05-05 发布在其他

关注(0)|答案(1)|浏览(176)

假设我有以下数据：

import pandas as pd

df = pd.DataFrame([
    ['01', 'A'],
    ['01', 'B'],
    ['01', 'C'],
    ['02', 'A'],
    ['02', 'B'],
    ['03', 'B'],
    ['03', 'C']
], columns=['id', 'category'])

如何创建这样的频率矩阵？

A   B   C           
A   2   2   1
B   2   3   2
C   1   2   2

一种方法是通过self join：

result = df.merge(df, on='id')
pd.pivot_table(
    result,
    index='category_x',
    columns='category_y',
    values='id',
    aggfunc='count'
)

但是这会使数据量非常大，有没有什么有效的方法来做到这一点，而不使用自连接？

编辑我的原始帖子因重复pivot_table而关闭。但是pivot_table只接受不同的columns和index。在我的例子中，我只有一个category列。所以呢

# Does not work
pivot_table(df, column='category', index='category', ...)

不起作用

pandas

来源：https://stackoverflow.com/questions/76090820/create-frequency-matrix-using-pandas

1条答案

按热度按时间

pqwbnv8z1#

下面是一种使用combinations_with_replacement和Counter的方法，来自Python标准库：

from collections import Counter
from itertools import combinations_with_replacement

pair_counts = Counter(
    df.groupby("id")
    .agg(list)
    .apply(lambda x: list(combinations_with_replacement(x["category"], 2)), axis=1)
    .sum()
)

new_df = pd.DataFrame()
for pair, count in pair_counts.items():
    new_df.at[pair[0], pair[1]] = count
    new_df.at[pair[1], pair[0]] = count

new_df = new_df.astype(int)

然后：

print(new_df)
# Output

   A  B  C
A  2  2  1
B  2  3  2
C  1  2  2

赞(0）回复(0）举报 2023-05-05

我来回答

使用pandas创建频率矩阵

1条答案

相关问题

热门标签

最新问答