Pandas drop_duplicates,具有重复的容差值

cqoc49vn  于 2023-06-04  发布在  其他
关注(0)|答案(2)|浏览(134)

我有两个XYZ格式的坐标Pandas Dataframe 。其中一个包含应该在另一个中被屏蔽的点,但是值彼此稍微偏移,这意味着不可能与drop_duplicates直接匹配。我的想法是将值四舍五入到最接近的有效数字,但这也不总是有效,因为如果某些值被四舍五入到不同的数字,它们将不匹配,也不会被删除。例如,如果一个点位于x = 149,而另一个点位于x = 151,则将它们四舍五入到最接近的百位会得到不同的值。我的代码看起来像这样:

import pandas as pd
import numpy as np
df_test_1 = pd.DataFrame(np.array([[123, 449, 756.102], [406, 523, 543.089], [140, 856, 657.24], [151, 242, 124.42]]), columns = ['x', 'y', 'z'])

df_test_2 = pd.DataFrame(np.array([[123, 451, 756.099], [404, 521, 543.090], [139, 859, 657.23], [633, 176, 875.76]]), columns = ['x', 'y', 'z'])

df_test_3 = pd.concat([df_test_1, df_test_2])

df_test_3['xr'] = df_test_3.x.round(-2)
df_test_3['yr'] = df_test_3.y.round(-2)
df_test_3['zr'] = df_test_3.z.round(1)

df_test_3 = df_test_3.drop_duplicates(subset=['xr', 'yr', 'zr'], keep=False)

如果列'xr'和'yr'重复+-100,而'zr'重复+-0.1,我希望删除重复项。例如,如果两个坐标四舍五入为(100,300,756.2)和(200,400,756.1),则应将其视为重复,并应将其删除。任何想法都很感激,谢谢!

idv4meu8

idv4meu81#

你可以numpy广播:

# Convert to numpy
vals1 = df_test_1.values
vals2 = df_test_2.values

# Remove from df_test_1
arr1 = np.abs(vals1 - vals2[:, None])
msk1 = ~np.any(np.all(arr1 < [100, 100, 0.1], axis=2), axis=1)

# Remove from df_test_2
arr2 = np.abs(vals2 - vals1[:, None])
msk2 = ~np.any(np.all(arr1 < [100, 100, 0.1], axis=2), axis=1)

out = pd.concat([df_test_1[msk1], df_test_2[msk2]], ignore_index=True)

输出:

>>> out
       x      y       z
0  151.0  242.0  124.42
1  633.0  176.0  875.76

评论@James
这将删除left vs right和right vs left,但不会删除left vs left或right vs right中的重复项。
在这种情况下:

df_test_3 = pd.concat([df_test_1, df_test_2])

arr = df_test_3.values
msk = np.abs(arr - arr[:, None]) < [100, 100, 0.1]
out = df_test_3[np.sum(np.all(msk, axis=2), axis=1) == 1]
print(out)

# Output
       x      y       z
3  151.0  242.0  124.42
3  633.0  176.0  875.76
k97glaaz

k97glaaz2#

要在阈值内对数据框进行重复数据消除,您需要计算每列中每个值之间的差异,并查看这些值是否在阈值差异范围内。这是针对任何 Dataframe 的通用解决方案。

from itertools import combinations

df = df_test_3.reset_index(drop=True)

# using combinations ensures a lower-triangular matrix of comparison indices
mi = MultiIndex.from_tuples(combinations(df.index, 2))
ix_left = mi.get_level_values(0)
ix_right = mi.get_level_values(1)
df_cross_diff = df.loc[ix_left].set_index(mi) - df.loc[ix_right].set_index(mi)

# that mask is comparisons that are considered duplicates.
drop_mask = pd.concat([
    df_cross_diff.x.abs().le(100),
    df_cross_diff.y.abs().le(100),
    df_cross_diff.z.abs().le(0.1),
], axis=1).all(axis=1)

# this extracts the pairs of indices that are considered duplicates:
dupes_mi = drop_mask.loc[drop_mask].index
dupes_left = matches_mi.get_level_values(0)
dupes_right = matches_mi.get_level_values(1)

# and finally we remove the duplicates.
df_deduped = df.drop(dupes_right)

相关问题