Pandas：我如何看到 Dataframe 中两个列表之间的重叠？

imzjd6km 于 2023-01-11 发布在其他

关注(0)|答案(3)|浏览(210)

我有一个 Dataframe ，其中有两列，每列都包含列表。我想确定这两列中列表之间的重叠。
例如：

df = pd.DataFrame({'one':[['a', 'b', 'c'], ['d', 'e', 'f'], ['h', 'i', 'j']], 
                   'two':[['b', 'c', 'd'], ['f', 'g', 'h',], ['l', 'm', 'n']]})

one         two
    0   [a, b, c]   [b, c, d]
    1   [d, e, f]   [f, g, h]
    2   [h, i, j]   [l, m, n]

最终，我希望它看起来像：

one         two             overlap
    0   [a, b, c]   [b, c, d]       [b, c]
    1   [d, e, f]   [f, g, h]       [f]
    2   [h, i, j]   [l, m, n]       []

pandas

来源：https://stackoverflow.com/questions/75074793/pandas-how-do-i-see-the-overlap-between-two-lists-in-a-dataframe

3条答案

按热度按时间

oxiaedzo1#

没有有效的矢量方法来执行此操作，最快的方法是使用set交集的列表解析：

df['overlap'] = [list(set(a)&set(b)) for a,b in zip(df['one'], df['two'])]

输出：

one        two overlap
0  [a, b, c]  [b, c, d]  [b, c]
1  [d, e, f]  [f, g, h]     [f]
2  [h, i, j]  [l, m, n]      []

赞(0）回复(0）举报 2023-01-11

7nbnzgx92#

下面是使用applymap将列表转换为集合并使用set.intersection查找重叠的方法：

df.join(df.applymap(set).apply(lambda x: set.intersection(*x),axis=1).map(list).rename('overlap'))

赞(0）回复(0）举报 2023-01-11

v2g6jxz63#

使用`pandas`

Pandas的实现方式可能是这样的-

f = lambda row: list(set(row['one']).intersection(row['two']))
df['overlap'] = df.apply(f,1)
print(df)

one        two overlap
0  [a, b, c]  [b, c, d]  [b, c]
1  [d, e, f]  [f, g, h]     [f]
2  [h, i, j]  [l, m, n]      []

apply函数逐行（axis = 1）查找one列和two列的列表之间的set.intersection()，然后将结果作为列表返回。
Apply方法不是最快的，但在imo中相当容易理解。但由于您的问题没有提到速度作为一个标准，这不是问题。
此外，您可以使用这两个表达式中的任何一个作为lambda函数，因为它们执行相同的任务-

#Option 1:
f = lambda x: list(set(x['one']) & set(x['two']))

#Option 2:
f = lambda x: list(set(x['one']).intersection(x['two']))

使用`Numpy`

您也可以使用numpy方法np.intersect1d以及2个系列上的Map。
一个三个三个一个

基准

添加一些基准以供参考-

%timeit [list(set(a)&set(b)) for a,b in zip(df['one'], df['two'])]        #list comprehension
%timeit df.apply(lambda x: list(set(x['one']).intersection(x['two'])),1)  #apply 1
%timeit df.apply(lambda x: list(set(x['one']) & set(x['two'])),1)         #apply 2
%timeit pd.Series(map(np.intersect1d, df['one'], df['two']))              #numpy intersect1d

6.99 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
167 µs ± 830 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
166 µs ± 338 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
84.1 µs ± 270 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

赞(0）回复(0）举报 2023-01-11

我来回答

Pandas：我如何看到 Dataframe 中两个列表之间的重叠？

3条答案

使用`pandas`

使用`Numpy`

基准

相关问题

热门标签

最新问答

Pandas：我如何看到 Dataframe 中两个列表之间的重叠？

3条答案

使用pandas

使用Numpy

基准

相关问题

热门标签

最新问答

使用`pandas`

使用`Numpy`