计算Pandas中没有任何重叠的共现

3zwjbxry  于 2022-12-10  发布在  其他
关注(0)|答案(1)|浏览(119)

我有以下 Dataframe

import pandas as pd
df = pd.DataFrame({'TFD' : ['AA', 'SL', 'BB', 'D0', 'Dk', 'FF'],
                    'Snack' : [1, 0, 1, 1, 0, 0],
                    'Trans' : [1, 1, 1, 0, 0, 1],
                    'Dop' : [1, 0, 1, 0, 1, 1]}).set_index('TFD')
df

    Snack   Trans   Dop
TFD         
AA  1   1   1
SL  0   1   0
BB  1   1   1
D0  1   0   0
Dk  0   0   1
FF  0   1   1

通过使用this,我可以计算出以下共现矩阵:

df_asint = df.astype(int)
coocc = df_asint.T.dot(df_asint)
coocc

    Snack   Trans   Dop
Snack   3   2   2
Trans   2   4   3
Dop     2   3   4

不过,我不希望出现重叠。
我的意思是:

  • 在原始的X1 M0 N1 X中,只有1个X1 M1 N1 X具有**,只有**X1 M2 N1 X,所以在X1 M4 N1 X表中的X1 M3 N1 X值应该是X1 M5 N1 X。
  • 此外,[Dop, Trans]应该等于1,而不等于3(上面的计算将3作为输出,因为它考虑了[Dop, Snack, Trans]组合,这是我想要避免的)
  • 此外,顺序不重要-〉[Dop, Trans][Trans, Dop]相同
  • 具有['all', 'all'] [row, column],表示一个事件包含所有元素的次数

我的解决方案包含以下步骤:
首先,对于df的每一行,获取列值等于1的列的列表:

llist = []
for k,v in df.iterrows():
    llist.append((list(v[v==1].index)))
llist

[['Snack', 'Trans', 'Dop'],
 ['Trans'],
 ['Snack', 'Trans', 'Dop'],
 ['Snack'],
 ['Dop'],
 ['Trans', 'Dop']]

然后我复制列表(在列表内),其中只有1个元素:

llist2 = llist.copy()
for i,l in enumerate(llist2):
    if len(l) == 1:
        llist2[i] = l + l
    if len(l) == 3:
        llist2[i] = ['all', 'all'] # this is to see how many triple elements I have in the list
llist2.append(['Dop', 'Trans']) # This is to test that the order of the elements of the sublists doesnt matter
llist2

[['all', 'all'],
 ['Trans', 'Trans'],
 ['all', 'all'],
 ['Snack', 'Snack'],
 ['Dop', 'Dop'],
 ['Trans', 'Dop'],
 ['Dop', 'Trans']]

稍后,我创建了一个空的 Dataframe ,其中包含所需的索引和列:

elements = ['Trans', 'Dop', 'Snack', 'all']
foo = pd.DataFrame(columns=elements, index=elements)
foo.fillna(0,inplace=True)
foo

Trans   Dop Snack   all
Trans   0   0   0   0
Dop     0   0   0   0
Snack   0   0   0   0
all     0   0   0   0

然后我检查和计数,哪个组合包含在原来的llist2

from itertools import combinations_with_replacement
import collections

comb = combinations_with_replacement(elements, 2)
for l in comb:
    val = foo.loc[l[0],l[1]]
    foo.loc[l[0],l[1]] = val + llist2.count(list(l))
    if (set(l).__len__() != 1) and (list(reversed(list(l))) in llist2): # check if the reversed element exists as well, but do not double count the diagonal elements
        val = foo.loc[l[0],l[1]]
        foo.loc[l[0],l[1]] = val + llist2.count(list(reversed(list(l))))
foo

Trans   Dop Snack   all
Trans   1   2   0   0
Dop     0   1   0   0
Snack   0   0   1   0
all     0   0   0   2

最后一步是使foo对称:

import numpy as np

foo = np.maximum( foo, foo.transpose() )
foo

Trans   Dop Snack   all
Trans   1   2   0   0
Dop     2   1   0   0
Snack   0   0   1   0
all     0   0   0   2

寻求更高效/更快速(避免所有这些循环)的解决方案

kupeojn6

kupeojn61#

设法将其缩小为一个“for”循环。我将“any”和“all”与“mask”结合使用。

import pandas as pd
import itertools

df = pd.DataFrame({'TFD': ['AA', 'SL', 'BB', 'D0', 'Dk', 'FF'],
                   'Snack': [1, 0, 1, 1, 0, 0],
                   'Trans': [1, 1, 1, 0, 0, 1],
                   'Dop':   [1, 0, 1, 0, 1, 1]}).set_index('TFD')

df["all"] = 0  # adding artifical columns so the results contains "all"
list_of_columns = list(df.columns)
my_result_list = []  # empty list where we put the results
comb = itertools.combinations_with_replacement(list_of_columns, 2)
for item in comb:
    temp_list = list_of_columns[:]  # temp_list holds columns of interest
    if item[0] == item[1]:
        temp_list.remove(item[0])
        my_col_list = [item[0]]  # my_col_list holds which occurance we count
    else:
        temp_list.remove(item[0])
        temp_list.remove(item[1])
        my_col_list = [item[0], item[1]]

    mask = df.loc[:, temp_list].any(axis=1)  # creating mask so we know which rows to look at
    distance = df.loc[~mask, my_col_list].all(axis=1).sum()  # calculating ocurrance
    my_result_list.append([item[0], item[1], distance])  # occurance info recorded in the list
    my_result_list.append([item[1], item[0], distance])  # occurance put in reverse so we get square form in the end

result = pd.DataFrame(my_result_list).drop_duplicates().pivot(index=1, columns=0, values=2)  # construc DataFrame in squareform
list_of_columns.remove("all")
result.loc["all", "all"] = df.loc[:, list_of_columns].all(axis=1).sum()  # fill in all/all occurances
print(result)

相关问题