numpy 在Python中查找大型数据集一部分周围的最优(最小)船体

kfgdxczn  于 2022-11-24  发布在  Python
关注(0)|答案(1)|浏览(152)

我有许多(x,y)点,我想在这些点的可变百分比周围拟合最小(或最小的近似或估计)船体。
我可以找到的现有实现得到了所有点周围的船体。我怎么能有这样的函数:

def get_convec_hull(points, percentage):
    # some code here
    return smallest_hull

我可以使用下面的代码来强行得到我想要的结果。这正是我想要的,只是它真的很慢。在40个数据点的25%周围找到最小的船体已经需要几个部分。我的真实数据有好几百万个点。

# A brute force solution..

from scipy.spatial import ConvexHull
import itertools

import numpy as np
import math

np.random.seed(11111)

# Simulating my data..
n_points = 20
target_percentage = 0.25
points = np.random.random((n_points, 2))

def get_convec_hull(points, percentage):
    n_points = points.shape[0]
    target_points = int(n_points * percentage)
    print("Looking for smallest polygon covering {} of the {} datapoints!".format(target_points, n_points))
    print(target_points)

    subsets = itertools.combinations(list(range(n_points)), target_points)

    optimal = ConvexHull(points[next(subsets),:]) # first one designated as optimal at the start
    for subset in subsets:
        hull = ConvexHull(points[subset,:])
        if hull.area < optimal.area:
            optimal = hull

    return optimal

optimal = get_convec_hull(points, target_percentage) 
optimal.area # returns 0.85234...

此外,仅为了说明,这是“最优的”,即,对于模拟数据集,大约25%(5个点)的最小船体。

如何将此应用于更大的数据集?

ljsrvy3e

ljsrvy3e1#

请意识到我冒着生命危险发布这个;我相当肯定实际的计算机科学家已经在追踪我的IP。
我以下列方式重写了您的函数:

def get_suboptimal_hull(points,p,magic_number):
    n_points = points.shape[0]
    tp = int(n_points * p)
    
    d_mat = distance_matrix(points,points)
    d_mat_s = d_mat.copy()
    d_mat_s.sort(axis=1)
    
    dist = np.sum(d_mat_s[:,0:5],axis=1)
    dens = d_mat_s[:,4]/(dist**2)
    densstr = [str(x)[0:4] for x in dens]
    dens_s = dens.copy()
    dens_s.sort()

    mnp = -1 * int(magic_number*points.shape[0]) - 1
    suboptimal = points[dens > dens_s[mnp]]
        
    n_points = len(suboptimal)
    
    subsets = itertools.combinations(list(range(n_points)), tp)

    optimal = ConvexHull(suboptimal[next(subsets),:])
    for subset in subsets:
        hull = ConvexHull(suboptimal[subset,:])
        if hull.area < optimal.area:
            optimal = hull

    return optimal

我实现加速的方法是基于密度参数"巧妙地"修剪点树。通过修剪除数为magic_number的树,可以实现magic_number^2除数加速(生成的树为N^2)。需要注意两件事:

  • 我还是用你的函数来逼近最优值
  • 加速比和性能在很大程度上取决于magic_number的值。

目前,magic_number是点的百分比,但可以很容易地修改为固定的数字以稳定性能。可能对您的应用程序有用。
初始化:

from scipy.spatial import ConvexHull
from scipy.spatial import distance_matrix
import itertools
import numpy as np
import math
import matplotlib.pyplot as plt

np.random.seed(11111)

n_points = 20
target_percentage = 0.25
points = np.random.random((n_points, 2))
magic_number = 0.5

面积是相同的(我认为,可能有例外,即is不为真):

optimal = get_convec_hull(points, target_percentage) 
suboptimal = get_suboptimal_hull(points, target_percentage,0.5)
print((optimal.area, suboptimal.area))

退货:

(0.8523426939615691, 0.8523426939615691)

(我注意到在这个配置中,这是周长。还是一样的艰难。)
加速:

yours = %timeit -r 3 -o get_convec_hull(points, target_percentage)
mine = %timeit -r 3 -o get_suboptimal_hull(points, target_percentage,magic_number)
your_t = np.mean(yours.all_runs)
my_t = np.mean(mine.all_runs)
print(f'For magic number {magic_number}, expected {int(100*(1/magic_number**2))} % speedup, got a {int(100* (your_t/my_t))} % speedup.')

退货:

8.44 s ± 162 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
143 ms ± 1.67 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
For magic number 0.5, expected 400 % speedup, got a 590 % speedup.

相关问题