pandas 基于一列从df中删除离群值

xfyts7mz  于 2023-02-28  发布在  其他
关注(0)|答案(3)|浏览(244)

我的df有一个price列,看起来像

0         2125.000000
1        14469.483703
2        14101.832820
3        20287.619019
4        14469.483703
             ...     
12561     2490.000000
12562     2931.283333
12563     1779.661017
12566     2200.000000
12567     2966.666667

我想删除df中price_m2列有异常值的所有行。我尝试了两种方法:
第一次:

df_w_o = df[np.abs(df.price_m2-df.price_m2.mean())<=(1*df.price_m2.std())]

第二名:

df['z_score'] = (df['price_m2'] - df['price_m2'].mean()) / df['price_m2'].std()

df_w_o = df[(df['z_score'] < 1) & (df['z_score'] > -1)]

之后检查最小值和最大值时,得到:

print(df_w_o.price_m2.min())
print(df_w_o.price_m2.max())
0.0
25438.022812290565

删除之前,我得到:

print(df.price_m2.min())
print(df.price_m2.max())
0.0
589933.4267822268

这感觉不对,我怎么能得到这么大的价格区间,数据应该是关于真实的地产的,在这个例子中,0是极低的,在剔除离群值后仍然存在。

vnjpjtjt

vnjpjtjt1#

假设OP的原始数据呈正态分布,并且不存在异常值。原始数据集的高值(约为589933)很可能是数据集的异常值。让我们为随机创建的数据集创建一个分位数-分位数图:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

n = 100

np.random.seed(0)
df = pd.DataFrame({"price": np.random.normal(25000, 3000, n)})
qqplt = sm.qqplot(df["price"], line = 's',fit = True)
plt.show()

然而,我们可以用一个离群值完全扭曲这一点。

outlier = 600000
df.loc[n] = outlier
qqplt = sm.qqplot(df["price"], line = 's',fit = True)
plt.show()

每当我们谈到离群值删除时,如果“感觉不对”,我们真的需要退一步来看看数据。正如@kndahl建议的那样,使用包含启发式和数据删除方法的软件包是很好的。否则,直觉应该用你自己的统计分析来支持。
最后,关于为什么0仍然在最终的数据集中,让我们再看看。我们将添加0到数据集,并运行您的离群值删除。首先,我们将看看运行您的默认离群值删除,然后我们将首先删除极高的600,000美元,然后再运行您的离群值方法。

## simulated data with 0 also added
df.loc[n+1] = 0
df_w_o = df[np.abs(df.price-df.price.mean())<=(1*df.price.std())] 
print(f"With the high outlier of 600,000 still in the original dataset, the new range is \nMin:{df_w_o.price.min()}\nMax:{df_w_o.price.max()}")

## With the high outlier of 600,000 still in the original dataset, the new range is 
## Min:0.0
## Max:31809.263871962823

## now lets remove the high outlier first before doing our outlier removal
df = df.drop(n)

df_w_o = df[np.abs(df.price-df.price.mean())<=(1*df.price.std())] 
print(f"\n\nWith the outlier of 600,000 removed prior to analyzing the data, the new range is \nMin:{df_w_o.price.min()}\nMax:{df_w_o.price.max()}")

## With the outlier of 600,000 removed prior to analyzing the data, the new range is
## Min:21241.61391985022
## Max:28690.87204218316

在此模拟案例中,高离群值使统计量发生了很大的偏差,以至于0在一个标准差的范围内。一旦我们在处理之前清理了数据,该0就会被删除。相关地,如果提供了更完整的数据集,则在交叉验证中可能会更好。

nhn9ugyo

nhn9ugyo2#

@SlimPun,我是这个意思:

import pandas as pd
import numpy as np

df=pd.DataFrame(np.random.normal(loc=10,scale=5,size=1000))  ## 1000 itens in price column
df.columns=["Price"]

用nan替换离群值:

df[(df.Price>(np.mean(df.Price)+2*np.std(df.Price))) | (df.Price<(np.mean(df.Price)-2*np.std(df.Price)))]=np.nan

丢弃离群值

df=df.dropna(how='all')
df.shape ## (951,1) - without outliers ** this can change according to your distribution given by numpy
vuktfyat

vuktfyat3#

这将使用每个数值列的过滤清除离群值,需要对超出上限和下限的数据点进行离群值处理。

column_list = ['col1', 'col2']

def outlier_clean(df, column_list):
    for i in column_list:
        q1 = np.quantile(df[i], 0.25)
        q3 = np.quantile(df[i], 0.75)
        median = np.median(df[i])
        IQR = q3 - q1
        upper_cap = median + (1.5 * IQR)
        lower_cap = median - (1.5 * IQR)
        mask1 = df[i] < upper_cap  
        mask2 =df[i] > lower_cap
      
        df = df[mask1 | mask2]
    return df

df = outlier_clean(df, column_list)

相关问题