pandas 一种提高Python函数运行时间的方法

我有一个函数，它依赖于一个字典来识别它需要读取的文件和它们所链接的变量名（在这种情况下是平均值）并将其作为现有 Dataframe 中的列返回。当文件不同时，这一切都很好用，但我有几个情况下，我有不同的变量链接到同一个文件。该函数计算一些zonal统计为一个Geopandas df。
有没有更有效的方法来读取文件？目前，函数多次读取同一个文件并计算相同的值，但将它们保存在不同的列名下。事实证明，就时间而言，这是非常低效的。

# The data dictionary has a structure like the one below. Sometimes there are 4-5 variables linked to the same file.
FILEPATH = {"variable_1": "path/commonfile.tif",
            "variable_2": "path/commonfile.tif",
            "variable_3": "path/commonfile.tif",
            "variable_4": "path/otherfile1.tif",
            "variable_5": "path/someotherfile1.tif"}

# my function is like the one below
for variable, filename in FILEPATH.items():
      my_df.loc[:, f'{variable}'] = myfunction(df=my_df, file=f'{filename}', stats_list = ['mean'])

一种改进Python代码运行时的方法是在调用myfunction之前读取.tif文件，并将其修改为期望表示文件路径的.tif对象。这样，您可以读取每个文件一次，而不必一遍又一遍地重新读取它。
这里有一个关于如何实现这一点的示例：

import time
from typing import Dict

import numpy as np
from PIL import Image
import pandas as pd

start_time = time.time()

# The data dictionary has a structure like the one below.
# Sometimes there are 4-5 variables linked to the same file.
FILEPATHS = {
    "variable_0": "/some/path/example_tiff2.tiff",
    "variable_1": "/some/path/example_tiff3.tiff",
    "variable_2": "/some/path/example_tiff3.tiff",
    "variable_3": "/some/path/example_tiff2.tiff",
    "variable_4": "/some/path/example_tiff3.tiff",
    "variable_5": "/some/path/example_tiff3.tiff",
    "variable_6": "/some/path/example_tiff.tiff",
}

def transform_filepaths(filepaths: Dict[str, str]) -> Dict[str, np.array]:
    """
    Transforms a dictionary of `filepaths` to a dictionary of numpy arrays.

    Function reads all the different filepaths once, and then replaces each
    variable's filepath with their corresponding numpy array.

    Parameters
    ----------
    filepaths : Dict[str, str]
        Dictionary that contains each variable's filepath.

    Returns
    -------
    Dict[str, np.array]
        Dictionary that contains each variable's numpy array representation
        of the TIFF file.
    """
    tiff_arrays_dict = {
        filepath: np.array(Image.open(filepath)) for filepath in set(filepaths.values())
    }
    return {
        variable_name: tiff_arrays_dict[filepath]
        for variable_name, filepath in filepaths.items()
    }

def myfunction(df: pd.DataFrame, tiff_array: np.array, stats_list=None):
    """Function with the same logic as before, but now it expects a numpy array
    instead of a filepath to then read the TIFF file.
    """
    # do stuff with the tiff_array
    return [tiff_array] * df.shape[0]

# == Create some dummy data ====================================================
my_df = pd.DataFrame({"variable_1": [1, 2, 3], "variable_2": [4, 5, 6]})

# == Transform the filepaths to numpy arrays ===================================
filepaths = transform_filepaths(FILEPATHS)

# == Apply the function ========================================================
for variable, tiff_object in filepaths.items():
    # Instead of passing the filepath, we pass the numpy array
    # therefore we don't need to read the TIFF file multiple times.
    my_df.loc[:, f"{variable}"] = myfunction(my_df, tiff_object, stats_list=["mean"])

print(f"Took: {time.time() - start_time:.2f}s")
# Original implementation: 6.69s
# New implementation: 0.42s

在上面的代码中，我们定义了一个名为transform_filepaths的新函数。该函数负责阅读FILEPATHS中定义的每个不同文件一次，并生成一个Map变量名和加载对象的新字典。为了使用这个新生成的字典，您还必须修改myfunction，以便它期望加载的对象，而不是包含文件位置的字符串。

pandas 一种提高Python函数运行时间的方法

1条答案

相关问题

热门标签

最新问答