计算Pandas中没有任何重叠的共现

我有以下 Dataframe

import pandas as pd
df = pd.DataFrame({'TFD' : ['AA', 'SL', 'BB', 'D0', 'Dk', 'FF'],
                    'Snack' : [1, 0, 1, 1, 0, 0],
                    'Trans' : [1, 1, 1, 0, 0, 1],
                    'Dop' : [1, 0, 1, 0, 1, 1]}).set_index('TFD')
df

    Snack   Trans   Dop
TFD         
AA  1   1   1
SL  0   1   0
BB  1   1   1
D0  1   0   0
Dk  0   0   1
FF  0   1   1

通过使用this，我可以计算出以下共现矩阵：

df_asint = df.astype(int)
coocc = df_asint.T.dot(df_asint)
coocc

    Snack   Trans   Dop
Snack   3   2   2
Trans   2   4   3
Dop     2   3   4

不过，我不希望出现重叠。
我的意思是：

在原始的X1 M0 N1 X中，只有1个X1 M1 N1 X具有**，只有**X1 M2 N1 X，所以在X1 M4 N1 X表中的X1 M3 N1 X值应该是X1 M5 N1 X。
此外，[Dop, Trans]应该等于1，而不等于3（上面的计算将3作为输出，因为它考虑了[Dop, Snack, Trans]组合，这是我想要避免的）
此外，顺序不重要-〉[Dop, Trans]与[Trans, Dop]相同
具有['all', 'all'] [row, column]，表示一个事件包含所有元素的次数

我的解决方案包含以下步骤：
首先，对于df的每一行，获取列值等于1的列的列表：

llist = []
for k,v in df.iterrows():
    llist.append((list(v[v==1].index)))
llist

[['Snack', 'Trans', 'Dop'],
 ['Trans'],
 ['Snack', 'Trans', 'Dop'],
 ['Snack'],
 ['Dop'],
 ['Trans', 'Dop']]

然后我复制列表（在列表内），其中只有1个元素：

llist2 = llist.copy()
for i,l in enumerate(llist2):
    if len(l) == 1:
        llist2[i] = l + l
    if len(l) == 3:
        llist2[i] = ['all', 'all'] # this is to see how many triple elements I have in the list
llist2.append(['Dop', 'Trans']) # This is to test that the order of the elements of the sublists doesnt matter
llist2

[['all', 'all'],
 ['Trans', 'Trans'],
 ['all', 'all'],
 ['Snack', 'Snack'],
 ['Dop', 'Dop'],
 ['Trans', 'Dop'],
 ['Dop', 'Trans']]

稍后，我创建了一个空的 Dataframe ，其中包含所需的索引和列：

elements = ['Trans', 'Dop', 'Snack', 'all']
foo = pd.DataFrame(columns=elements, index=elements)
foo.fillna(0,inplace=True)
foo

Trans   Dop Snack   all
Trans   0   0   0   0
Dop     0   0   0   0
Snack   0   0   0   0
all     0   0   0   0

然后我检查和计数，哪个组合包含在原来的llist2

from itertools import combinations_with_replacement
import collections

comb = combinations_with_replacement(elements, 2)
for l in comb:
    val = foo.loc[l[0],l[1]]
    foo.loc[l[0],l[1]] = val + llist2.count(list(l))
    if (set(l).__len__() != 1) and (list(reversed(list(l))) in llist2): # check if the reversed element exists as well, but do not double count the diagonal elements
        val = foo.loc[l[0],l[1]]
        foo.loc[l[0],l[1]] = val + llist2.count(list(reversed(list(l))))
foo

Trans   Dop Snack   all
Trans   1   2   0   0
Dop     0   1   0   0
Snack   0   0   1   0
all     0   0   0   2

最后一步是使foo对称：

import numpy as np

foo = np.maximum( foo, foo.transpose() )
foo

Trans   Dop Snack   all
Trans   1   2   0   0
Dop     2   1   0   0
Snack   0   0   1   0
all     0   0   0   2

寻求更高效/更快速（避免所有这些循环）的解决方案

设法将其缩小为一个“for”循环。我将“any”和“all”与“mask”结合使用。

import pandas as pd
import itertools

df = pd.DataFrame({'TFD': ['AA', 'SL', 'BB', 'D0', 'Dk', 'FF'],
                   'Snack': [1, 0, 1, 1, 0, 0],
                   'Trans': [1, 1, 1, 0, 0, 1],
                   'Dop':   [1, 0, 1, 0, 1, 1]}).set_index('TFD')

df["all"] = 0  # adding artifical columns so the results contains "all"
list_of_columns = list(df.columns)
my_result_list = []  # empty list where we put the results
comb = itertools.combinations_with_replacement(list_of_columns, 2)
for item in comb:
    temp_list = list_of_columns[:]  # temp_list holds columns of interest
    if item[0] == item[1]:
        temp_list.remove(item[0])
        my_col_list = [item[0]]  # my_col_list holds which occurance we count
    else:
        temp_list.remove(item[0])
        temp_list.remove(item[1])
        my_col_list = [item[0], item[1]]

    mask = df.loc[:, temp_list].any(axis=1)  # creating mask so we know which rows to look at
    distance = df.loc[~mask, my_col_list].all(axis=1).sum()  # calculating ocurrance
    my_result_list.append([item[0], item[1], distance])  # occurance info recorded in the list
    my_result_list.append([item[1], item[0], distance])  # occurance put in reverse so we get square form in the end

result = pd.DataFrame(my_result_list).drop_duplicates().pivot(index=1, columns=0, values=2)  # construc DataFrame in squareform
list_of_columns.remove("all")
result.loc["all", "all"] = df.loc[:, list_of_columns].all(axis=1).sum()  # fill in all/all occurances
print(result)

计算Pandas中没有任何重叠的共现

1条答案

相关问题

热门标签

最新问答