如何根据条件删除Pandas数据框中的列？

rbpvctlc 于 2023-03-16 发布在其他

关注(0)|答案(3)|浏览(103)

我有一个Pandas数据框，里面有很多NAN值。
如何删除number_of_na_values > 2000这样的列？
我试着这样做：

toRemove = set()
naNumbersPerColumn = df.isnull().sum()
for i in naNumbersPerColumn.index:
    if(naNumbersPerColumn[i]>2000):
         toRemove.add(i)
for i in toRemove:
    df.drop(i, axis=1, inplace=True)

有没有更优雅的方法？

pandas

来源：https://stackoverflow.com/questions/31614804/how-to-delete-a-column-in-pandas-dataframe-based-on-a-condition

3条答案

按热度按时间

fnvucqvd1#

下面是另一种方法，可以保留每列中nan数小于或等于指定值的列：

max_number_of_nas = 3000
df = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nas)]

在我的测试中，这似乎比Jianxun Li在我测试的案例中建议的drop columns方法要快一些（如下所示）。然而，我应该注意，如果你只是简单地不使用apply方法（例如df.drop(df.columns[df.isnull().sum(axis=0) > max_number_of_nans], axis=1)），性能会变得更相似。

np.random.seed(0)
df = pd.DataFrame(np.random.randn(10000,5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5010

%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 1.1 ms ± 4.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit c = df.drop(df.columns[df.isnull().sum(axis=0) > max_number_of_nans], axis=1)
>> 1.3 ms ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 2.11 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

性能通常随数据大小而变化，因此不要忘记检查与数据最接近的情况。

np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5

%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 755 µs ± 4.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit c = df.drop(df.columns[df.isnull().sum(axis=0) > max_number_of_nans], axis=1)
>> 777 µs ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 1.71 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

赞(0）回复(0）举报 2023-03-16

cuxqih212#

同样的逻辑，只是把所有的东西放在一行。

import pandas as pd
import numpy as np

# artificial data
# ====================================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10,5), columns=list('ABCDE'))
df[df < 0] = np.nan

        A       B       C       D       E
0  1.7641  0.4002  0.9787  2.2409  1.8676
1     NaN  0.9501     NaN     NaN  0.4106
2  0.1440  1.4543  0.7610  0.1217  0.4439
3  0.3337  1.4941     NaN  0.3131     NaN
4     NaN  0.6536  0.8644     NaN  2.2698
5     NaN  0.0458     NaN  1.5328  1.4694
6  0.1549  0.3782     NaN     NaN     NaN
7  0.1563  1.2303  1.2024     NaN     NaN
8     NaN     NaN     NaN  1.9508     NaN
9     NaN     NaN  0.7775     NaN     NaN

# processing: drop columns with no. of NaN > 3
# ====================================
df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > 3)], axis=1)

Out[183]:
        B
0  0.4002
1  0.9501
2  1.4543
3  1.4941
4  0.6536
5  0.0458
6  0.3782
7  1.2303
8     NaN
9     NaN

赞(0）回复(0）举报 2023-03-16

sbtkgmzw3#

对我来说，我似乎不需要set_index：

df = (df.T
     .loc[lambda x: ((x['label'] > .05) | (x['label'] < -.05))]
     .T.reset_index().set_index('index'))

赞(0）回复(0）举报 2023-03-16

我来回答

如何根据条件删除Pandas数据框中的列？

3条答案

相关问题

热门标签

最新问答