pandas apply()自定义函数在所有列上提高效率

xuo3flqw 于 2023-01-01 发布在其他

关注(0)|答案(2)|浏览(141)

我使用这个函数

def calculate_recency_for_one_column(column: pd.Series) -> int:
    """Returns the inverse position of the last non-zero value in a pd.Series of numerics.
    If the last value is non-zero, returns 1. If all values are non-zero, returns 0."""
    non_zero_values_of_col = column[column.astype(bool)]
    if non_zero_values_of_col.empty:
        return 0
    return len(column) - non_zero_values_of_col.index[-1]

添加到此示例 Dataframe 的所有列

df = pd.DataFrame(np.random.binomial(n=1, p=0.001, size=[1000000]).reshape((1000,1000)))

通过使用

df.apply(lambda column: calculate_recency_for_one_column(column),axis=0)

结果是：

0      436
1        0
2      624
3        0
      ... 
996    155
997    715
998    442
999    163
Length: 1000, dtype: int64

一切正常，但我的程序必须经常做这个操作，所以我需要一个更有效的替代方案。有人知道如何使这个更快吗？我认为calculate_recency_for_one_column()是足够有效的，df.apply()有最大的改进潜力。这里作为基准（100次重复）：

>> timeit.timeit(lambda: df.apply(lambda column: calculate_recency_for_one_column(column),axis=0), number=100)
14.700050864834338

- 更新**

穆斯塔法的回答是：

>> timeit.timeit(lambda: pd.Series(np.where(df.eq(0).all(), 0, len(df) - df[::-1].idxmax())), number=100)
0.8847485752776265

帕杜回答：

>> timeit.timeit(lambda: df.apply(calculate_recency_for_one_column_numpy, raw=True, axis=0), number=100)
0.8892530500888824

pandas

来源：https://stackoverflow.com/questions/74940265/apply-custom-function-on-all-columns-increase-efficiency

2条答案

按热度按时间

x0fgdtte1#

您可以不将列视为Series对象，而将其视为numpy数组。为此，只需在apply方法中指定raw=True参数。还需要稍微更改原始函数。

import time

import numpy as np
import pandas as pd

def calculate_recency_for_one_column(column: np.ndarray) -> int:
    """Returns the inverse position of the last non-zero value in a np.ndarray of numerics.
    If the last value is non-zero, returns 1. If all values are non-zero, returns 0."""
    non_zero_values_of_col = np.nonzero(column)[0]
    if not non_zero_values_of_col.any():
        return 0
    return len(column) - non_zero_values_of_col[-1]

df = pd.DataFrame(np.random.binomial(n=1, p=0.001, size=[1000000]).reshape((1000,1000)))

start = time.perf_counter()
res = df.apply(calculate_recency_for_one_column, raw=True)
print(f'time took {time.perf_counter() - start:.3f} s.')

Out:
    0.005 s.

赞(0）回复(0）举报 2023-01-01

sqougxex2#

其中是向量化的if-else，所以：

np.where(df.eq(0).all(), 0, len(df) - df[::-1].idxmax())

对于给定的列，其值 *all**是否等于0？
如果是，则将结果置0
否则，获取 last 1的索引（因此使用[：：-1]进行反转，并从len(df)取rsub

定时比较：

In [261]: %timeit np.where(df.eq(0).all(), 0, len(df) - df[::-1].idxmax())
10.6 ms ± 338 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [262]: %timeit df.apply(calculate_recency_for_one_column)
180 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

和健全性检查：

In [263]: (np.where(df.eq(0).all(), 0, len(df) - df[::-1].idxmax())
                 == df.apply(calculate_recency_for_one_column)).all()
Out[263]: True

赞(0）回复(0）举报 2023-01-01

我来回答

pandas apply()自定义函数在所有列上提高效率

2条答案

相关问题

热门标签

最新问答