Pandas-提高分组和应用自定义函数时的性能

flmtquvp 于 2022-11-27 发布在其他

关注(0)|答案(1)|浏览(119)

我有一个这样的 Dataframe ，数据大小大约超过100，000行。
| 类别|值1|值2| val 3值|val 4值|
| - -|- -|- -|- -|- -|
| A级|一个|2个|三个|四个|
| A级|四个|三个|2个|一个|
| B|一个|2个|三个|四个|
| B|三个|四个|一个|2个|
| B|一个|五个|三个|一个|
我想先用Category列分组，然后在每组中用我自己的方法进行计算。
自定义方法返回浮点值cal。
所需的输出是带有结果的字典形式。

{ 
    'A': { 'cal': a },
    'B:' { 'cal': b },
    ...
}

我尝试使用pandas的groupby和apply。

def my_cal(df):
    ret = ...
    return {'cal': ret}

df.groupby('Category').apply(lambda grp: my_cal(grp)).to_dict()

当我用timeit在jupyter笔记本电脑上测量时间时，它需要超过1秒，这对我来说太长了。
是否有办法优化这一点并缩短执行时间？

-------------编辑------------
已将my_cal的参数从 Dataframe 更新为数组。

def my_cal(val1: float, val2: float, val3: float, val4: float):
    ret = inner_cal(val1, val2, val3, val4) # inner_cal is in external library
    return {'cal': ret}

df.groupby('Category').apply(lambda grp: my_cal(grp['val1'].to_numpy(),
                                                grp['val2'].to_numpy(),
                                                grp['val3'].to_numpy(),
                                                grp['val4'].to_numpy())).to_dict()

pandas

来源：https://stackoverflow.com/questions/74539837/pandas-improve-performance-when-grouping-and-applying-custom-function

1条答案

按热度按时间

4xrmg8kj1#

以下是您可以尝试的一些方法：

在应用分组依据之前，通过删除具有无效值的元素来减少行数（如果可能）。
通过缩小数据框的列数据类型来减少其内存占用量。
使用numba生成my_cal函数的优化机器码版本。

您还可以在此处找到您可能考虑尝试的其他策略：https://pandas.pydata.org/docs/user_guide/enhancingperf.html#

收缩列数据类型

下列程式码可让您将每一个数据行资料类型转换成尽可能小的表示法，以减少数据框的内存使用量。例如，如果您有一个数据行，其值储存为int64，程式码会尝试判断数据行的值范围是否可以表示为int8、int16、或int32。此外，它还可以将数据类型为object的值转换为category，将int转换为uint。

import numpy as np
import pandas as pd

def df_shrink_dtypes(df, skip=None, obj2cat=True, int2uint=False):
    """
    Try to shrink data types for ``DataFrame`` columns.

    Allows ``object`` -> ``category``, ``int`` -> ``uint``, and exclusion.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe to shrink.
    skip : list, default=[]
        The names of the columns to skip.
    obj2cat : bool, default=True
        Whether to cast ``object`` columns to ``category``.
    int2uint : bool, default=False
        Whether to cast ``int`` columns to ``uint``.

    Returns
    -------
    new_dtypes : dict
        The new data types for the columns.
    """
    if skip is None:
        skip = []
    # 1: Build column filter and type-map
    excl_types, skip = {"category", "datetime64[ns]", "bool"}, set(skip)

    typemap = {
        "int": [
            (np.dtype(x), np.iinfo(x).min, np.iinfo(x).max)
            for x in (np.int8, np.int16, np.int32, np.int64)
        ],
        "uint": [
            (np.dtype(x), np.iinfo(x).min, np.iinfo(x).max)
            for x in (np.uint8, np.uint16, np.uint32, np.uint64)
        ],
        "float": [
            (np.dtype(x), np.finfo(x).min, np.finfo(x).max)
            for x in (np.float32, np.float64, np.longdouble)
        ],
    }
    if obj2cat:
        # User wants to "categorify" dtype('Object'),
        # which may not always save space.
        typemap["object"] = "category"
    else:
        excl_types.add("object")

    new_dtypes = {}
    exclude = lambda dt: dt[1].name not in excl_types and dt[0] not in skip

    for c, old_t in filter(exclude, df.dtypes.items()):
        t = next((v for k, v in typemap.items() if old_t.name.startswith(k)), None)

        # Find the smallest type that fits
        if isinstance(t, list):
            if int2uint and t == typemap["int"] and df[c].min() >= 0:
                t = typemap["uint"]
            new_t = next(
                (r[0] for r in t if r[1] <= df[c].min() and r[2] >= df[c].max()), None
            )
            if new_t and new_t == old_t:
                new_t = None
        else:
            new_t = t if isinstance(t, str) else None
        if new_t:
            new_dtypes[c] = new_t
    return new_dtypes

def df_shrink(df, skip=None, obj2cat=True, int2uint=False):
    """Reduce memory usage, shrinking data types for ``DataFrame`` columns.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe to shrink.
    skip : list, default=[]
        The names of the columns to skip.
    obj2cat : bool, default=True
        Whether to cast ``object`` columns to ``category``.
    int2uint : bool, default=False
        Whether to cast ``int`` columns to ``uint``.

    Returns
    -------
    df : pandas.DataFrame
        The dataframe with the new data types.

    See Also
    --------
    - :func:`df_shrink_dtypes`: function that determines the new data types to
      use for each column.
    """
    if skip is None:
        skip = []
    dt = df_shrink_dtypes(df, skip, obj2cat=obj2cat, int2uint=int2uint)
    return df.astype(dt)

范例：

# Generating dataframe with 100,000 rows, and 5 columns:

nrows = 100_000
cats = ["A", "B", "C", "D", "E", "F", "G"]

df = pd.DataFrame(
    {"Category": np.random.choice(cats, size=nrows),
     "val1": np.random.randint(1, 8, nrows),
     "val2": np.random.randint(1, 8, nrows),
     "val3": np.random.randint(1, 8, nrows),
     "val4": np.random.randint(1, 8, nrows)}
)

df.dtypes
#
# Category    object
# val1         int64
# val2         int64
# val3         int64
# val4         int64
# dtype: object

# Applying `df_shrink` to `df` columns:
_df = df_shrink(df)

_df.dtypes
#
# Category    category
# val1            int8
# val2            int8
# val3            int8
# val4            int8
# dtype: object

# Comparring memory usage of `df` vs. `_df`:

df.info(memory_usage=True)
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100000 entries, 0 to 99999
# Data columns (total 5 columns):
#  #   Column    Non-Null Count   Dtype 
# ---  ------    --------------   ----- 
#  0   Category  100000 non-null  object
#  1   val1      100000 non-null  int64 
#  2   val2      100000 non-null  int64 
#  3   val3      100000 non-null  int64 
#  4   val4      100000 non-null  int64 
# dtypes: int64(4), object(1)
# memory usage: 3.8+ MB     <---- Original memory footprint

_df.info(memory_usage=True)
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100000 entries, 0 to 99999
# Data columns (total 5 columns):
#  #   Column    Non-Null Count   Dtype   
# ---  ------    --------------   -----   
#  0   Category  100000 non-null  category
#  1   val1      100000 non-null  int8    
#  2   val2      100000 non-null  int8    
#  3   val3      100000 non-null  int8    
#  4   val4      100000 non-null  int8    
# dtypes: category(1), int8(4)
# memory usage: 488.8 KB     <---- Almost 8x reduction!

使用`numba`生成`my_cal`函数的优化机器码版本

要在Python环境中安装numba，请执行以下命令：

pip install -U numba

要在panda中使用Numba，你必须定义my_cal，并使用@jit对其进行修饰。你还需要将底层的grp值作为NumPy数组传递。你可以使用to_numpy()方法来实现这一点。下面是一个函数的示例：

import numpy as np
import pandas as pd
import numba

# NOTE: define each column separately, and inform each data type, to improve performance.
@numba.jit
def my_cal(val1: int, val2: int, val3: int, val4: int):
    return val1 + val2 + val3 + val4

# Using numba optimized version of `my_cal`:

%%timeit
_df.groupby('Category').apply(
    lambda grp: my_cal(
        grp['val1'].to_numpy(),
        grp['val2'].to_numpy(),
        grp['val3'].to_numpy(),
        grp['val4'].to_numpy(),
    )
).to_dict()
# 6.33 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

执行时间比较

下面的代码比较了实现DataFrame.groupby/apply操作的不同方法：

# OPTION 1: original implementation
df.groupby('Category').apply(lambda grp: grp.sum(numeric_only=True)).to_dict()
# 18.9 ms ± 500 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# OPTION 2: original implementation with memory optimized dataframe
_df.groupby('Category').apply(lambda grp
grp.sum(numeric_only=True)).to_dict()
# 9.96 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# OPTION 3: Using numba optimized `my_cal` function, with memory optimized dataframe
_df.groupby('Category').apply(
    lambda grp: my_cal(
        grp['val1'].to_numpy(),
        grp['val2'].to_numpy(),
        grp['val3'].to_numpy(),
        grp['val4'].to_numpy(),
    )
).to_dict()
# 6.33 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

结果摘要：

| 实施|每个循环的执行时间|
| - -|- -|
| 方案1| 18.9毫秒± 500微秒|
| 方案2| 9.96毫秒± 140微秒|
| 方案3| 6.33毫秒± 221微秒|

编辑：使用`numba`优化`my_cal`函数

警告

Numba最擅长加速将数值函数应用到NumPy数组的**函数。如果您尝试@jit一个包含不支持的Python或NumPy代码的函数，编译将恢复对象模式，这很可能不会加速您的函数。您收到的警告是因为my_cal正在调用一个没有经过@jit优化的内部函数，因此，numba无法优化您的代码。如果您可以访问并修改inner_cal，那么您可以尝试在其中包含@jit装饰器并指定其参数类型提示。这种方法的问题是，如果inner_cal包含对其他函数的调用，在选择将所有内部函数转换为numba之前，我强烈建议您分析代码，以确定那些内部函数是否也在numpy数组上操作。否则这是浪费时间。
下面给予一个示例，如果使用numba，则inner_cal函数应该是这样的：

@numba.jit
def inner_cal(val1: float, val2: float, val3: float, val4: float) -> float:
    return val1 + val2 + val3 + val4

@numba.jit
def my_cal(val1: float, val2: float, val3: float, val4: float) -> dict:
    ret = inner_cal(val1, val2, val3, val4) # inner_cal is in external library
    return {'cal': ret}

赞(0）回复(0）举报 2022-11-27

我来回答

Pandas-提高分组和应用自定义函数时的性能

1条答案

收缩列数据类型

使用`numba`生成`my_cal`函数的优化机器码版本

执行时间比较

编辑：使用`numba`优化`my_cal`函数

警告

相关问题

热门标签

最新问答

Pandas-提高分组和应用自定义函数时的性能

1条答案

收缩列数据类型

使用numba生成my_cal函数的优化机器码版本

执行时间比较

编辑：使用numba优化my_cal函数

警告

相关问题

热门标签

最新问答

使用`numba`生成`my_cal`函数的优化机器码版本

编辑：使用`numba`优化`my_cal`函数