pandas 基于一列从df中删除离群值

xfyts7mz 于 2023-02-28 发布在其他

关注(0)|答案(3)|浏览(245)

我的df有一个price列，看起来像

0         2125.000000
1        14469.483703
2        14101.832820
3        20287.619019
4        14469.483703
             ...     
12561     2490.000000
12562     2931.283333
12563     1779.661017
12566     2200.000000
12567     2966.666667

我想删除df中price_m2列有异常值的所有行。我尝试了两种方法：
第一次：

df_w_o = df[np.abs(df.price_m2-df.price_m2.mean())<=(1*df.price_m2.std())]

第二名：

df['z_score'] = (df['price_m2'] - df['price_m2'].mean()) / df['price_m2'].std()

df_w_o = df[(df['z_score'] < 1) & (df['z_score'] > -1)]

在之后检查最小值和最大值时，得到：

print(df_w_o.price_m2.min())
print(df_w_o.price_m2.max())
0.0
25438.022812290565

在删除之前，我得到：

print(df.price_m2.min())
print(df.price_m2.max())
0.0
589933.4267822268

这感觉不对，我怎么能得到这么大的价格区间，数据应该是关于真实的地产的，在这个例子中，0是极低的，在剔除离群值后仍然存在。

pandas

来源：https://stackoverflow.com/questions/72239562/remove-outliers-from-df-based-on-one-column

3条答案

按热度按时间

vnjpjtjt1#

假设OP的原始数据呈正态分布，并且不存在异常值。原始数据集的高值（约为589933）很可能是数据集的异常值。让我们为随机创建的数据集创建一个分位数-分位数图：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

n = 100

np.random.seed(0)
df = pd.DataFrame({"price": np.random.normal(25000, 3000, n)})
qqplt = sm.qqplot(df["price"], line = 's',fit = True)
plt.show()

然而，我们可以用一个离群值完全扭曲这一点。

outlier = 600000
df.loc[n] = outlier
qqplt = sm.qqplot(df["price"], line = 's',fit = True)
plt.show()

每当我们谈到离群值删除时，如果“感觉不对”，我们真的需要退一步来看看数据。正如@kndahl建议的那样，使用包含启发式和数据删除方法的软件包是很好的。否则，直觉应该用你自己的统计分析来支持。
最后，关于为什么0仍然在最终的数据集中，让我们再看看。我们将添加0到数据集，并运行您的离群值删除。首先，我们将看看运行您的默认离群值删除，然后我们将首先删除极高的600，000美元，然后再运行您的离群值方法。

## simulated data with 0 also added
df.loc[n+1] = 0
df_w_o = df[np.abs(df.price-df.price.mean())<=(1*df.price.std())] 
print(f"With the high outlier of 600,000 still in the original dataset, the new range is \nMin:{df_w_o.price.min()}\nMax:{df_w_o.price.max()}")

## With the high outlier of 600,000 still in the original dataset, the new range is 
## Min:0.0
## Max:31809.263871962823

## now lets remove the high outlier first before doing our outlier removal
df = df.drop(n)

df_w_o = df[np.abs(df.price-df.price.mean())<=(1*df.price.std())] 
print(f"\n\nWith the outlier of 600,000 removed prior to analyzing the data, the new range is \nMin:{df_w_o.price.min()}\nMax:{df_w_o.price.max()}")

## With the outlier of 600,000 removed prior to analyzing the data, the new range is
## Min:21241.61391985022
## Max:28690.87204218316

在此模拟案例中，高离群值使统计量发生了很大的偏差，以至于0在一个标准差的范围内。一旦我们在处理之前清理了数据，该0就会被删除。相关地，如果提供了更完整的数据集，则在交叉验证中可能会更好。

赞(0）回复(0）举报 2023-02-28

nhn9ugyo2#

@SlimPun，我是这个意思：

import pandas as pd
import numpy as np

df=pd.DataFrame(np.random.normal(loc=10,scale=5,size=1000))  ## 1000 itens in price column
df.columns=["Price"]

用nan替换离群值：

df[(df.Price>(np.mean(df.Price)+2*np.std(df.Price))) | (df.Price<(np.mean(df.Price)-2*np.std(df.Price)))]=np.nan

丢弃离群值

df=df.dropna(how='all')
df.shape ## (951,1) - without outliers ** this can change according to your distribution given by numpy

赞(0）回复(0）举报 2023-02-28

vuktfyat3#

这将使用每个数值列的过滤清除离群值，需要对超出上限和下限的数据点进行离群值处理。

column_list = ['col1', 'col2']

def outlier_clean(df, column_list):
    for i in column_list:
        q1 = np.quantile(df[i], 0.25)
        q3 = np.quantile(df[i], 0.75)
        median = np.median(df[i])
        IQR = q3 - q1
        upper_cap = median + (1.5 * IQR)
        lower_cap = median - (1.5 * IQR)
        mask1 = df[i] < upper_cap  
        mask2 =df[i] > lower_cap
      
        df = df[mask1 | mask2]
    return df

df = outlier_clean(df, column_list)

赞(0）回复(0）举报 2023-02-28

我来回答

pandas 基于一列从df中删除离群值

3条答案

相关问题

热门标签

最新问答