numpy Python检测直方图中的孤立边缘,以检测时间序列数据中的异常值

utugiqy6  于 2023-10-19  发布在  Python
关注(0)|答案(2)|浏览(94)

我试图用自己的方式找出异常值。怎么做?绘制直方图,搜索具有几个计数的孤立边缘和零计数邻居或边缘。通常,它们将位于直方图的远端。这些可能是异常值。发现并丢弃它们。是什么样的数据?时间序列来自外地。有时,当传感器未能及时传输数据并且数据记录器存储这些奇怪的数字时,您会看到奇怪的数字(虽然传感器数据在50-100之间,但异常值可能是-10000,1000)。它们是瞬时的,在一年的数据中可能发生几次,并且将小于总样本的1%。
我的代码:

# vals, edges = np.histogram(df['column'],bins=20)
# obtained result is 
vals = [    38      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      1     11 126664  13853   4536]
edges = [ 0.        2.911165  5.82233   8.733495 11.64466  14.555825 17.46699
 20.378155 23.28932  26.200485 29.11165  32.022815 34.93398  37.845145
 40.75631  43.667475 46.57864  49.489805 52.40097  55.312135 58.2233  ]

# repeat last sample twice in the vals. Why: because vals always have one sample less than edges
vals = np.append(vals, vals[-1])
vedf = pd.DataFrame(data = {'edges':edges,'vals':vals})
# Replace all zero samples with NaN. Hence, these rows will not recognized. 
vedf['vals'] = vedf['vals'].replace(0,np.nan)
# Identify the isolated edges by looking the number of samples, say, < 50
vedf['IsolatedEdge?'] = vedf['vals'] <50
# plot histogram
plt.plot(vedf['edges'],vedf['vals'],'o')
plt.show()

当前输出:
这不是正确的输出。为什么?在值0处的开始处只有一个孤立的边。然而,在这里,我的代码将43和46处的值检测为孤立值,只是因为它们的计数较少。

vedf = 

      edges     vals    IsolatedEdge?
0   0.000000    38.0    True
1   2.911165    NaN     False
2   5.822330    NaN     False
3   8.733495    NaN     False
4   11.644660   NaN     False
5   14.555825   NaN     False
6   17.466990   NaN     False
7   20.378155   NaN     False
8   23.289320   NaN     False
9   26.200485   NaN     False
10  29.111650   NaN     False
11  32.022815   NaN     False
12  34.933980   NaN     False
13  37.845145   NaN     False
14  40.756310   NaN     False
15  43.667475   1.0     True
16  46.578640   11.0    True
17  49.489805   126664.0    False
18  52.400970   13853.0     False
19  55.312135   4536.0  False
20  58.223300   4536.0  False

预期输出:

vedf = 

      edges     vals    IsolatedEdge?
0   0.000000    38.0    True
1   2.911165    NaN     False
2   5.822330    NaN     False
3   8.733495    NaN     False
4   11.644660   NaN     False
5   14.555825   NaN     False
6   17.466990   NaN     False
7   20.378155   NaN     False
8   23.289320   NaN     False
9   26.200485   NaN     False
10  29.111650   NaN     False
11  32.022815   NaN     False
12  34.933980   NaN     False
13  37.845145   NaN     False
14  40.756310   NaN     False
15  43.667475   1.0     False
16  46.578640   11.0    False
17  49.489805   126664.0    False
18  52.400970   13853.0     False
19  55.312135   4536.0  False
20  58.223300   4536.0  False

一旦我知道一个特定的边缘是孤立的,我就可以把所有的样本都放在边缘上。

bqucvtff

bqucvtff1#

这种方法使用for循环。对于每个bin,它检查bin是否满足3个标准:(1)当前仓具有> 0且< 50的值,以及(2)左侧 * 的仓为空(或没有左侧仓),以及(3)右侧的仓也为空(或没有右侧仓)。如果满足所有这些条件,则将当前bin标记为隔离。

# vals, edges = np.histogram(df['column'],bins=20)
# obtained result is 
vals = [    38   ,   0,      0  ,    0   ,   0     , 0  ,    0  ,    0  ,    0   ,   0,
      0     , 0     , 0 ,     0   ,   0   ,   1    , 11, 12.6664  ,13.853,   4.536]

edges = [ 0. ,       2.911165,  5.82233 ,  8.733495, 11.64466 , 14.555825 ,17.46699,
 20.378155 ,23.28932  ,26.200485 ,29.11165  ,32.022815, 34.93398  ,37.845145,
 40.75631 , 43.667475 ,46.57864 , 49.489805, 52.40097  ,55.312135, 58.2233  ]

plt.stem(edges[:-1], vals)
is_isolated = []
for bin_idx in range(len(vals)):
    has_left_bin = True if bin_idx > 0 else False
    has_right_bin = True if bin_idx < len(vals) - 1 else False
    
    if (has_left_bin and vals[bin_idx - 1]==0) or not has_left_bin:
        left_empty = True
    else:
        left_empty = False
        
    if (has_right_bin and vals[bin_idx + 1]==0) or not has_right_bin:
        right_empty = True
    else:
        right_empty = False
        
    if (0 < vals[bin_idx] < 50) and left_empty and right_empty:
        is_isolated.append(True)
    else:
        is_isolated.append(False)
    

vdef = pd.DataFrame({'vals': vals, 'edges': edges[:-1], 'is_isolated': is_isolated})
vdef
tquggr8v

tquggr8v2#

在@Mark建议之后,我在这里发布了我的完整解决方案:

# index of normal edges or data
normal_edge_idx = vedf[~vedf['vals'].isna() & ~(vedf['vals'].shift(1).isna() & vedf['vals'].shift(-1).isna())].index
# index of outlier edge: not normal edges and nans
out_edge_idx = vedf[(~vedf.index.isin(normal_edge_idx))&(~vedf['vals'].isna())].index
# check if there is atleast one outlier edge
if len(out_edge_idx) > 0:
    # iterate through each outlier edge and drop those edges
    for iso_idx in out_edge_idx: 
        df1 = df1[~((df1[col]>=vedf['edges'].iloc[iso_idx])&(df1[col]<=vedf['edges'].iloc[(iso_idx+1)]))]

#Impact of this solution before and after the dropping the outliers:

在检测和过滤离群值之前:

在检测和过滤离群值之后:

相关问题