实现高斯平滑的更快方法?(Python 3.10,NumPy)

igsr9ssn  于 2023-01-24  发布在  Python
关注(0)|答案(2)|浏览(156)

我尝试在我的Python 3.10脚本中实现高斯平滑/展平函数来展平一组XY点。对于每个数据点,我创建了一个Y缓冲区和一个高斯核,我使用它来根据它的邻居展平每个Y点。
以下是高斯平滑方法的一些来源:

我将NumPy模块用于数据数组,将MatPlotLib模块用于绘制数据。
我写了一个可重复性最小的示例,其中包含一些随机生成的数据,高斯函数所需的每个参数都列在main函数的顶部:

import numpy as np
import matplotlib.pyplot as plt
import time

def main():
    dataSize = 1000
    yDataRange = [-4, 4]
    reachPercentage = 0.1
    sigma = 10
    phi = 0
    amplitude = 1

    testXData = np.arange(stop = dataSize)
    testYData = np.random.uniform(low = yDataRange[0], high = yDataRange[1], size = dataSize)

    print("Flattening...")
    startTime = time.time()

    flattenedYData = GaussianFlattenData(testXData, testYData, reachPercentage, sigma, phi, amplitude)

    totalTime = round(time.time() - startTime, 2)
    print("Flattened! (" + str(totalTime) + " sec)")

    plt.title(str(totalTime) + " sec")
    plt.plot(testXData, testYData, label = "Original Data")
    plt.plot(testXData, flattenedYData, label = "Flattened Data")
    plt.legend()

    plt.show()

    plt.close()

def GaussianFlattenData(xData, yData, reachPercentage, sigma, phi, amplitude):
    flattenedYData = np.empty(shape = len(xData), dtype = float)

    # For each data point, create a Y buffer and a Gaussian kernel, and flatten it based on it's neighbours
    for i in range(len(xData)):
        gaussianCenter = xData[i]
        baseReachEdges = GetGaussianValueX((GetGaussianValueY(0, 0, sigma, phi, amplitude) * reachPercentage), 0, sigma, phi, amplitude)
        reachEdgeIndices = [FindInArray(xData, GetClosestNum((gaussianCenter + baseReachEdges[0]), xData)), 
                            FindInArray(xData, GetClosestNum((gaussianCenter + baseReachEdges[1]), xData))]
        currDataScanNum = reachEdgeIndices[0] - reachEdgeIndices[1]

        # Creating Y buffer and Gaussian kernel...
        currYPoints = np.empty(shape = currDataScanNum, dtype = float)
        kernel = np.empty(shape = currDataScanNum, dtype = float)

        for j in range(currDataScanNum):
            currYPoints[j] = yData[j + reachEdgeIndices[1]]
            kernel[j] = GetGaussianValueY(j, (i - reachEdgeIndices[1]), sigma, phi, amplitude)

        # Dividing kernel by its sum...
        kernelSum = np.sum(kernel)

        for j in range(len(kernel)):
            kernel[j] = (kernel[j] / kernelSum)

        # Acquiring the current flattened Y point...
        newCurrYPoints = np.empty(shape = len(currYPoints), dtype = float)

        for j in range(len(currYPoints)):
            newCurrYPoints[j] = currYPoints[j] * kernel[j]

        flattenedYData[i] = np.sum(newCurrYPoints)

    return flattenedYData

def GetGaussianValueX(y, mu, sigma, phi, amplitude):
    x = ((sigma * np.sqrt(-2 * np.log(y / (amplitude * np.cos(phi))))) + mu)

    return [x, (mu - (x - mu))]

def GetGaussianValueY(x, mu, sigma, phi, amplitude):
    y = ((amplitude * np.cos(phi)) * np.exp(-np.power(((x - mu) / sigma), 2) / 2))

    return y

def GetClosestNum(base, nums):
    closestIdx = 0
    closestDiff = np.abs(base - nums[0])
    idx = 1

    while (idx < len(nums)):
        currDiff = np.abs(base - nums[idx])

        if (currDiff < closestDiff):
            closestDiff = currDiff
            closestIdx = idx

        idx += 1

    return nums[closestIdx]

def FindInArray(arr, value):
    for i in range(len(arr)):
        if (arr[i] == value):
            return i

    return -1

if (__name__ == "__main__"):
    main()

在上面的例子中,我生成了1,000个随机数据点,范围在-4到4之间。reachPercentage变量是高斯振幅的百分比,高于该百分比的高斯值将被插入到内核中。phiamplitude变量都是高斯函数的输入,该高斯函数实际上将为要平滑的每个Y数据点生成高斯。
我写了一些额外的实用函数,我需要以及。
上面的脚本用于平滑生成的数据,我得到了下面的图:

蓝色为原始数据,橙子为展平的数据。
然而,平滑更少量的数据需要的时间却长得惊人。在上面的示例中,我生成了1,000个数据点,将其展平需要8秒左右。如果数据集的数量超过10,000个,则很容易需要10分钟以上。
由于这是一种非常流行的平滑数据的方法,我想知道为什么这个脚本运行得这么慢。我最初使用标准的Pythons Lists调用append来实现它,但是它非常慢。我希望使用NumPy数组而不调用append函数会使它更快,但是事实并非如此。
有没有一种方法可以加速这个过程?有没有一个高斯平滑函数已经存在,它接受相同的参数,可以更快地完成这项工作?
感谢阅读我的帖子,任何指导都是赞赏.

l7wslrjt

l7wslrjt1#

你有很多循环--这些循环往往会让你慢下来。
这里有两个例子,将GetClosestNum重构为:

def GetClosestNum(base, nums):
    nums = np.array(nums)
    diffs = np.abs(nums - base)
    return nums[np.argmin(diffs)]

并将FindInArray重构为:

def FindInArray(arr, value):
    res = np.where(np.array(arr) - value == 0)[0]
    if res.size > 0:
        return res[0]
    else:
        return -1

让我在1.5秒内处理5000个数据点,而不是用你原来的代码处理54秒。
Numpy让你不用循环就可以做很多功能强大的事情-- Jake Vanderplas有一些非常好的视频(很老,但很好),讲述了如何使用Numpy结构代替循环来大幅提高速度--https://www.youtube.com/watch?v=EEUXKG97YRw

um6iljoc

um6iljoc2#

在询问了the Python forums上的人,以及在网上做了更多的搜索之后,我设法找到了比我循环中的大多数函数更快的替代方法。
为了更好地了解平滑函数的哪些部分占用了最多的时间,我将代码细分为4个部分,并对每个部分计时,以查看每个部分对总运行时间的贡献。令我惊讶的是,占用了90%以上时间的部分是循环的第一部分:

gaussianCenter = xData[i]
        baseReachEdges = GetGaussianValueX((GetGaussianValueY(0, 0, sigma, phi, amplitude) * reachPercentage), 0, sigma, phi, amplitude)
        reachEdgeIndices = [FindInArray(xData, GetClosestNum((gaussianCenter + baseReachEdges[0]), xData)), 
                            FindInArray(xData, GetClosestNum((gaussianCenter + baseReachEdges[1]), xData))]
        currDataScanNum = reachEdgeIndices[0] - reachEdgeIndices[1]

幸运的是,Python论坛和这里的人能够帮助我,我能够找到一个更快的替代GetClosestNum函数(感谢Vin),以及删除FindInArray函数:
在循环的后半部分也有替换,其中不是有3个for循环,而是全部替换为我的NumPy迭代表达式。
整个脚本现在如下所示:

import numpy as np
import matplotlib.pyplot as plt
import time

def main():
    dataSize = 3073
    yDataRange = [-4, 4]
    reachPercentage = 0.001
    sigma = 100
    phi = 0
    amplitude = 1

    testXData = np.arange(stop = dataSize)
    testYData = np.random.uniform(low = yDataRange[0], high = yDataRange[1], size = dataSize)

    print("Flattening...")
    startTime = time.time()

    flattenedYData = GaussianFlattenData(testXData, testYData, reachPercentage, sigma, phi, amplitude)

    totalTime = round(time.time() - startTime, 2)
    print("Flattened! (" + str(totalTime) + " sec)")

    plt.title(str(totalTime) + " sec")
    plt.plot(testXData, testYData, label = "Original Data")
    plt.plot(testXData, flattenedYData, label = "Flattened Data")
    plt.legend()

    plt.show()

    plt.close()

def GaussianFlattenData(xData, yData, reachPercentage, sigma, phi, amplitude):
    flattenedYData = np.empty(shape = len(xData), dtype = float)

    # For each data point, create a Y buffer and a Gaussian kernel, and flatten it based on it's neighbours
    for i in range(len(xData)):
        gaussianCenter = xData[i]
        baseReachEdges = GetGaussianValueX((GetGaussianValueY(0, 0, sigma, phi, amplitude) * reachPercentage), 0, sigma, phi, amplitude)
        reachEdgeIndices = [np.where(xData == GetClosestNum((gaussianCenter + baseReachEdges[0]), xData))[0][0], 
                            np.where(xData == GetClosestNum((gaussianCenter + baseReachEdges[1]), xData))[0][0]]
        currDataScanNum = reachEdgeIndices[0] - reachEdgeIndices[1]

        # Creating Y buffer and Gaussian kernel...
        currYPoints = yData[reachEdgeIndices[1] : reachEdgeIndices[1] + currDataScanNum]
        kernel = GetGaussianValueY(np.arange(currDataScanNum), (i - reachEdgeIndices[1]), sigma, phi, amplitude)

        # Acquiring the current flattened Y point...
        flattenedYData[i] = np.sum(currYPoints * (kernel / np.sum(kernel)))

    return flattenedYData

def GetGaussianValueX(y, mu, sigma, phi, amplitude):
    x = ((sigma * np.sqrt(-2 * np.log(y / (amplitude * np.cos(phi))))) + mu)

    return [x, (mu - (x - mu))]

def GetGaussianValueY(x, mu, sigma, phi, amplitude):
    y = ((amplitude * np.cos(phi)) * np.exp(-np.power(((x - mu) / sigma), 2) / 2))

    return y

def GetClosestNum(base, nums):
    nums = np.asarray(nums)

    return nums[(np.abs(nums - base)).argmin()]

if (__name__ == "__main__"):
    main()

处理1,000个数据点不再需要约8秒,现在只需约0.15秒!

处理10,000个点也需要约1.75秒。
感谢大家的反馈,干杯!

相关问题