在Pandas或Python中逐组比较2列

vs91vp4v  于 2023-02-02  发布在  Python
关注(0)|答案(2)|浏览(107)

我目前在这里有一个数据集,我不确定如何比较各组是否具有相似的值。

type   value
a       1
a       2
a       3
a       4

b       2
b       3
b       4
b       5

c       1
c       3
c       4


d       2
d       3
d       4

我想知道哪些行是相似的,也就是说,所有的(1个类型中的值)都出现在另一个类型中。例如,类型d的值为2,3,4,类型a的值为1,2,3,4,所以这是"相似的",或者可以认为是相同的,所以我希望它输出的东西告诉我d与A相似。
预期输出应如下所示

type   value            similarity
a       1         A is similar to B and D
a       2
a       3
a       4

b       2         b is similar to a and d
b       3
b       4
b       5

c       1         c is similar to a 
c       3
c       4


d       2         d is similar to a and b
d       3
d       4

不知道这是否可以在python或panda中完成,但真的很感谢指导,因为我真的迷路了,不知道从哪里开始。
输出也不一定是我刚才在这里举的例子,它可以只是另一个csv,告诉我哪些类型是相似的,

czq61nw1

czq61nw11#

我会用集合运算。

假设相似性意味着至少有N个项目相同:
from itertools import combinations

# define minimum number of common items
N = 3

# aggregate as sets
s = df.groupby('type')['value'].agg(set)

# generate all combinations of sets
# and check is the intersection is at least N items
out = (pd.Series([len(a&b)>=N for a, b in combinations(s, 2)],
                 index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
      )

# concat and add the reversed combinations (a/b -> b/a)
# we could have used a product in the first part but this
# would have required performing the computations twice
similarity = (
 pd.concat([out, out.swaplevel()])
   .loc[lambda x: x].reset_index(-1)
   .groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)

# update the first row of each group with the string
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)

print(df)

输出:

type  value               similarity
0     a      1  a is similar to b, c, d
1     a      2                      NaN
2     a      3                      NaN
3     a      4                      NaN
4     b      2     b is similar to d, a
5     b      3                      NaN
6     b      4                      NaN
7     b      5                      NaN
8     c      1        c is similar to a
9     c      3                      NaN
10    c      4                      NaN
11    d      2     d is similar to a, b
12    d      3                      NaN
13    d      4                      NaN
假设相似性意味着一个集合是另一个集合的子集:
from itertools import combinations

s = df.groupby('type')['value'].agg(set)

out = (pd.Series([a.issubset(b) or b.issubset(a) for a, b in combinations(s, 2)],
                 index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
      )

similarity = (
 pd.concat([out, out.swaplevel()])
   .loc[lambda x: x].reset_index(-1)
   .groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)

df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)

print(df)

输出:

type  value            similarity
0     a      1  a is similar to c, d
1     a      2                   NaN
2     a      3                   NaN
3     a      4                   NaN
4     b      2     b is similar to d
5     b      3                   NaN
6     b      4                   NaN
7     b      5                   NaN
8     c      1     c is similar to a
9     c      3                   NaN
10    c      4                   NaN
11    d      2  d is similar to a, b
12    d      3                   NaN
13    d      4                   NaN
5fjcxozz

5fjcxozz2#

您可以用途:

# Group all rows and transform as set
df1 = df.groupby('type', as_index=False)['value'].agg(set)

# Get all combinations
df1 = df1.merge(df1, how='cross').query('type_x != type_y')

# Compute the intersection between sets
df1['similarity'] = [row.value_x.intersection(row.value_y) 
                         for row in df1[['value_x', 'value_y']].itertuples()]

# Keep rows with at least 3 similarities then export report
sim = (df1.loc[df1['similarity'].str.len() >= 3].groupby('type_x')['type_y']
          .agg(', '.join).rename('similarity').rename_axis(index='type')
          .reset_index())

输出:

>>> sim
  type similarity
0    a    b, c, d
1    b       a, d
2    c          a
3    d       a, b

相关问题