pandas 在我的代码上使用itterrows()以外的更有效的方法?

4c8rllxm  于 2022-11-27  发布在  其他
关注(0)|答案(1)|浏览(166)

这段代码要花很长时间才能运行,因为我有100万行和43列。它的想法是尝试找到对特定数量的列具有相同的值,但"CA"列必须相反,我们删除这对,因为它们将被视为反向行。
即我有一个 Dataframe = df
| A列|B栏|C列|D栏|
| - -| - -| - -| - -|
| '棕色'| '瓶'|小行星1234555| 100个|
| '黄色'| "杯"|小行星1234555|八十|
| '红色'| '瓶'|小行星1234555| -100个|
| '红色'| '瓶'|小行星1234555| -100个|
| '棕色'| '瓶'|小行星1234533| 100个|
如果我决定考虑B列和C列,程序将删除第一行和第三行,因为它们在B列和C列中的值相同,而在D列中的值相反(一个正,一个负)。它们也将被视为反转行,因此只删除这对行。
所需输出:
| A列|B栏|C列|D栏|
| - -| - -| - -| - -|
| '黄色'| "杯"|小行星1234555|八十|
| '红色'| '瓶'|小行星1234555| -100个|
| '棕色'| '瓶'|小行星1234533| 100个|
我目前拥有的代码是这样的,但是效率非常低:

df_dupes = data[data.duplicated(subset = criteria_, keep=False)]
df_dupes_list = np.array(df_dupes.to_numpy().tolist())

df_1 = df_dupes_list[:,[0,1,7,9,8,23,35]]

df_2 = df_1.tolist()

for i, row in df_dupes.iterrows():
    if row.ConvertedAUD < 0 and [row.BA, row.OA, row.BN, row.DN, row.DT,row.D, -row.CA] in df_2:
        try:
            c = np.where((data['BA'] ==row.BA) & (data['OA'] ==row.OA) & (data['BN'] ==row.BN)& (data['DT']         ==row.DT)& (data['DN'] ==row.DN)& (data['D'] ==row.D)&  (data['CA'] ==-row.CA))[0][0]

            data.drop(labels=[i,data.index.values[c]], axis=0, inplace=True)
        except:
            pass
ahy6op9u

ahy6op9u1#

我的解决方案是这样的:增加一个结构来快速找到相反的对,并创建一个布尔掩码进行过滤,而不是在循环中调用drop()

import pandas as pd

data = pd.DataFrame(
    [
        ["Brown", "Bottle", 1234555, 100],
        ["yellow", "Cup", 1234555, 80],
        ["Red", "Bottle", 1234555, -100],
        ["Red", "Bottle", 1234555, -100],
        ["Brown", "Bottle", 1234533, 100],
    ],
    columns=["A", "B", "C", "D"],
)

# "lookup table"
seen = {} # {(key1, key2): (index, value)}
# which rows to keep?
mask = pd.Series(True, index=data.index)

# itertuples is faster than iterrows
for row in data.itertuples():
    # create a lookup key
    key = (row.B, row.C)
    if key not in seen:
        # store Index and Value in the "lookup table"
        # if we haven't seen this key before
        seen[key] = (row.Index, row.D)
    else:
        prev_index, prev_value = seen[key]
        # if the stored value is the opposite of the current one
        if prev_value == -row.D:
            # we don't want to keep both rows
            mask.loc[prev_index] = False
            mask.loc[row.Index] = False
            # and remove the key from the lookup table
            del seen[key]
        # else:
            # undefined case:
            # the key exists, but the value is not
            # the opposite of the previous one

# remove "collapsed" rows from the data
result = data[mask]

相关问题