numpy 如何找到一组范围在数据集中存在的次数?

ymzxtsji  于 2023-08-05  发布在  其他
关注(0)|答案(1)|浏览(65)

我希望得到一些帮助/指导,以找出一组范围出现在最小/最大值数组中的次数。

import numpy as np; import pandas as pd
np.random.seed(42)
df = pd.DataFrame(np.random.rand(1000, 2),
                  columns=["Min", "Max"]) \
    .mul(6).query("Min <= Max").reset_index(drop=True)

lb = np.arange(1, 5, 0.1)
ub = np.arange(1, 5, 0.1)    

a = df[['Min','Max']].to_numpy()

factor = 1

arry = []

字符串
我的第一个尝试是将min/max转换为numpy数组,并迭代所有数组以生成输出。

factor = 1

arry = []

for row in a:    
    for x in lb:
        for y in ub:
            lbExists = 0
            rangeExists = 0
            if y <= x:
                continue
            if x >= row[0] and x <= row[1]:
                lbExists = 1            
            if x >= row[0] and y <= row[1]:
                rangeExists = 1
            project = 0
            if lbExists == 1 and rangeExists == 0:
                project -= factor
            if lbExists == 1 and rangeExists == 1:
                project = (y - x) * factor

            arry.append([round(x, 2), round(y, 2), row[0], row[1], lbExists, rangeExists , round(project,2)])


输出(截断):

[[1.0, 1.1, 1.34, 5.8, 0, 0, 0],
 [1.0, 1.2, 1.34, 5.8, 0, 0, 0],
 [1.0, 1.3, 1.34, 5.8, 0, 0, 0],
 [1.0, 1.4, 1.34, 5.8, 0, 0, 0],
 [1.0, 1.5, 1.34, 5.8, 0, 0, 0],
 [1.0, 1.6, 1.34, 5.8, 0, 0, 0],
 [1.0, 1.7, 1.34, 5.8, 0, 0, 0],
 [1.0, 1.8, 1.34, 5.8, 0, 0, 0],
 [1.0, 1.9, 1.34, 5.8, 0, 0, 0],
 [1.0, 2.0, 1.34, 5.8, 0, 0, 0],
...


然后,我想做一些分析,找出每个范围在集合中出现的次数。
有没有更好的方法来做到这一点?

3duebb1j

3duebb1j1#

一种更快的计算方法是使用各种numpy方法。

# array of all valid permutations of x and y
arr_xy = np.array(np.meshgrid(lb.round(2), ub.round(2))).T.reshape(-1, 2)
arr_xy = arr_xy[arr_xy[:, 0] < arr_xy[:, 1]]

# lbExists - (x >= row[0] and x <= row[1])
lbExists = (arr_xy[:, 0][:, np.newaxis] >= a[:, 0]) * \
    (arr_xy[:, 0][:, np.newaxis] <= a[:, 1])
# rangeExists - (x >= row[0] and y <= row[1])
rangeExists = lbExists * (arr_xy[:, 1][:, np.newaxis] <= a[:, 1])

# project based off of lbExists initially (else 0)...
project = np.where(lbExists,
                   # then on value of rangeExists
                   np.where(rangeExists,
                            # (y - x) * factor if true
                            np.tile(np.diff(arr_xy) * factor, a.shape[0]),
                            # else -1 * factor
                            np.full(rangeExists.shape, -factor)),
                   0)

# combine variables
arr_out = np.hstack([
    # permutations of upper and lower bound
    np.vstack([arr_xy] * a.shape[0]),
    # repeated values of Min and Max
    np.repeat(a, arr_xy.shape[0], axis=0),
    # lbExists 2d -> 1d
    lbExists.T.reshape(-1)[:, np.newaxis],
    # rangeExists 2d -> 1d
    rangeExists.T.reshape(-1)[:, np.newaxis],
    # project 2d -> 1d
    project.T.reshape(-1)[:, np.newaxis].round(2)])

字符串
我在整个代码中添加了注解,但请务必询问是否有任何需要澄清的地方。
时间上的差异(大约快100倍):

# your loop in the question
5.84 s ± 295 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# this method
50.2 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


并确认这些是相同的:

(arr_out==np.array(arry)).all()
#Out: True

相关问题