我试图用自己的方式找出异常值。怎么做?绘制直方图,搜索具有几个计数的孤立边缘和零计数邻居或边缘。通常,它们将位于直方图的远端。这些可能是异常值。发现并丢弃它们。是什么样的数据?时间序列来自外地。有时,当传感器未能及时传输数据并且数据记录器存储这些奇怪的数字时,您会看到奇怪的数字(虽然传感器数据在50-100之间,但异常值可能是-10000,1000)。它们是瞬时的,在一年的数据中可能发生几次,并且将小于总样本的1%。
我的代码:
# vals, edges = np.histogram(df['column'],bins=20)
# obtained result is
vals = [ 38 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 11 126664 13853 4536]
edges = [ 0. 2.911165 5.82233 8.733495 11.64466 14.555825 17.46699
20.378155 23.28932 26.200485 29.11165 32.022815 34.93398 37.845145
40.75631 43.667475 46.57864 49.489805 52.40097 55.312135 58.2233 ]
# repeat last sample twice in the vals. Why: because vals always have one sample less than edges
vals = np.append(vals, vals[-1])
vedf = pd.DataFrame(data = {'edges':edges,'vals':vals})
# Replace all zero samples with NaN. Hence, these rows will not recognized.
vedf['vals'] = vedf['vals'].replace(0,np.nan)
# Identify the isolated edges by looking the number of samples, say, < 50
vedf['IsolatedEdge?'] = vedf['vals'] <50
# plot histogram
plt.plot(vedf['edges'],vedf['vals'],'o')
plt.show()
当前输出:
这不是正确的输出。为什么?在值0处的开始处只有一个孤立的边。然而,在这里,我的代码将43和46处的值检测为孤立值,只是因为它们的计数较少。
vedf =
edges vals IsolatedEdge?
0 0.000000 38.0 True
1 2.911165 NaN False
2 5.822330 NaN False
3 8.733495 NaN False
4 11.644660 NaN False
5 14.555825 NaN False
6 17.466990 NaN False
7 20.378155 NaN False
8 23.289320 NaN False
9 26.200485 NaN False
10 29.111650 NaN False
11 32.022815 NaN False
12 34.933980 NaN False
13 37.845145 NaN False
14 40.756310 NaN False
15 43.667475 1.0 True
16 46.578640 11.0 True
17 49.489805 126664.0 False
18 52.400970 13853.0 False
19 55.312135 4536.0 False
20 58.223300 4536.0 False
预期输出:
vedf =
edges vals IsolatedEdge?
0 0.000000 38.0 True
1 2.911165 NaN False
2 5.822330 NaN False
3 8.733495 NaN False
4 11.644660 NaN False
5 14.555825 NaN False
6 17.466990 NaN False
7 20.378155 NaN False
8 23.289320 NaN False
9 26.200485 NaN False
10 29.111650 NaN False
11 32.022815 NaN False
12 34.933980 NaN False
13 37.845145 NaN False
14 40.756310 NaN False
15 43.667475 1.0 False
16 46.578640 11.0 False
17 49.489805 126664.0 False
18 52.400970 13853.0 False
19 55.312135 4536.0 False
20 58.223300 4536.0 False
一旦我知道一个特定的边缘是孤立的,我就可以把所有的样本都放在边缘上。
2条答案
按热度按时间bqucvtff1#
这种方法使用
for
循环。对于每个bin,它检查bin是否满足3个标准:(1)当前仓具有> 0且< 50的值,以及(2)左侧 * 的仓为空(或没有左侧仓),以及(3)右侧的仓也为空(或没有右侧仓)。如果满足所有这些条件,则将当前bin标记为隔离。tquggr8v2#
在@Mark建议之后,我在这里发布了我的完整解决方案:
在检测和过滤离群值之前:
在检测和过滤离群值之后: