pandas 使用Numba而不是itertuple()进行 Dataframe 迭代,以加快代码速度

w41d8nur  于 2022-11-27  发布在  其他
关注(0)|答案(1)|浏览(140)

"我的问题"
我有三个 Dataframe ,我使用itertuple循环。
Itertuple曾经运行得很好,但是现在我运行了太多的迭代,Itertuple不够高效。
我想使用矢量化或者Numba,因为我听说它们都非常快。我试过让它们工作,但我不能弄清楚
所有三个数据框都是开盘、高、低、收盘蜡烛线数据,还有一些其他列,如“FG_Top”
Dataframe 为

  1. dflong- 15分钟阴阳烛数据
  2. dfshort- 5分钟阴阳烛数据
  3. dfshorter- 1分钟阴阳烛数据
    注解中要求的 Dataframe 创建代码
import numpy as np
import pandas as pd

idx15m = ['2022-10-29 06:59:59.999', '2022-10-29 07:14:59.999', '2022-10-29 07:29:59.999', '2022-10-29 07:44:59.999',
         '2022-10-29 07:59:59.999', '2022-10-29 08:14:59.999', '2022-10-29 08:29:59.999']

opn15m = [19010, 19204, 19283, 19839, 19892, 20000, 20192]
hgh15m = [19230, 19520, 19921, 19909, 20001, 20203, 21065]
low15m = [18782, 19090, 19245, 19809, 19256, 19998, 20016]
cls15m = [19204, 19283, 19839, 19892, 20000, 20192, 20157]

FG_Bottom = [np.nan, np.nan, np.nan, np.nan, np.nan, 19909, np.nan]
FG_Top = [np.nan, np.nan, np.nan, np.nan, np.nan, 19998, np.nan]

dflong = pd.DataFrame({'Open': opn15m, 'High': hgh15m, 'Low': low15m, 'Close': cls15m, 'FG_Bottom': FG_Bottom, 'FG_Top': FG_Top},
                  index=idx15m)

idx5m = ['2022-10-29 06:59:59.999', '2022-10-29 07:05:59.999', '2022-10-29 07:10:59.999', '2022-10-29 07:15:59.999',
         '2022-10-29 07:20:59.999', '2022-10-29 07:25:59.999', '2022-10-29 07:30:59.999']

opn5m = [19012, 19102, 19165, 19747, 19781, 20009, 20082]
hgh5m = [19132, 19423, 19817, 19875, 20014, 20433, 21068]
low5m = [18683, 19093, 19157, 19758, 19362, 19893, 20018]
cls5m = [19102, 19165, 19747, 19781, 20009, 20082, 20154]

price_end5m = [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]

dfshort = pd.DataFrame({'Open': opn5m, 'High': hgh5m, 'Low': low5m, 'Close': cls5m, 'price_end': price_end5m},
                  index=idx5m)

idx1m = ['2022-10-29 06:59:59.999', '2022-10-29 07:01:59.999', '2022-10-29 07:02:59.999', '2022-10-29 07:03:59.999',
         '2022-10-29 07:04:59.999', '2022-10-29 07:05:59.999', '2022-10-29 07:06:59.999']

opn1m = [19010, 19104, 19163, 19748, 19783, 20000, 20087]
hgh1m = [19130, 19420, 19811, 19878, 20011, 20434, 21065]
low1m = [18682, 19090, 19154, 19754, 19365, 19899, 20016]
cls1m = [19104, 19163, 19748, 19783, 20000, 20087, 20157]

price_end1m = [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]

dfshorter = pd.DataFrame({'Open': opn1m, 'High': hgh1m, 'Low': low1m, 'Close': cls1m, 'price_end': price_end1m},
                  index=idx1m)

给予3个类似于以下DataFrame的DataFrame

** Dataframe 示例**

Open      High  ...          FG_Top          FG_Bottom
2022-10-29 06:59:59.999  20687.83  20700.46  ...             NaN                NaN
2022-10-29 07:14:59.999  20686.82  20695.74  ...             NaN                NaN
2022-10-29 07:29:59.999  20733.62  20745.30  ...        20733.62           20700.46
2022-10-29 07:44:59.999  20741.42  20762.75  ...             NaN                NaN
2022-10-29 07:59:59.999  20723.86  20777.00  ...             NaN                NaN
...                           ...       ...  ...             ...                ...
2022-11-10 02:14:59.999  16140.29  16167.09  ...             NaN                NaN
2022-11-10 02:29:59.999  16119.99  16195.19  ...             NaN                NaN
2022-11-10 02:44:59.999  16136.63  16263.15  ...             NaN                NaN
2022-11-10 02:59:59.999  16238.91  16238.91  ...             NaN                NaN
2022-11-10 03:14:59.999  16210.23  16499.00  ...             NaN                NaN

代码说明:

我有我的第一个 Dataframe ,我用第一个循环循环,然后用第二个嵌套循环再次循环。我有if语句,检查每次迭代的某些条件,如果这些条件得到满足,然后我在第一个 Dataframe np.nan上设置一些值。
在第二个循环中检查的条件之一调用包含第三个循环的函数,并检查其他2个 Dataframe 中的某些条件。

# First loop

        for fg_candle_idx, row in enumerate(dflong.itertuples()):
            top = row.FG_Top
            bottom = row.FG_Bottom
            fg_candle_time = row.Index
            if (pd.notnull(top)):

# Second loop

                for future_candle_idx, r in enumerate(dflong.itertuples()):
                    future_candle_time = r.Index
                    next_future_candle = future_candle_time + timedelta(minutes=minutes)
                    future_candle_high = r.High
                    future_candle_low = r.Low
                    future_candle_close = r.Close
                    future_candle_open = r.Open
                    if future_candle_idx > fg_candle_idx:
                        div = r.price_end

# Check conditions, call function check_no_divs

                        if (pd.isnull(check_no_divs(dfshort, future_candle_time, next_future_candle))) & (
                        pd.isnull(check_no_divs(dfshorter, future_candle_time, next_future_candle))) & (
                        pd.isnull(div)):

                            if future_candle_high < bottom:
                                continue

                            elif future_candle_low > top:
                                continue

                            elif (future_candle_close < bottom) & \
                                    (future_candle_open > top):
                                dflong.loc[fg_candle_time, 'FG_Bottom'] = np.nan
                                dflong.loc[fg_candle_time, 'FG_Top'] = np.nan
                                continue

# Many additional conditions checked...

下面的代码是函数check_no_divs

def check_no_divs(df, candle_time, next_candle):

        no_divs = []

# Third Loop

        for idx, row in enumerate(df.itertuples()):
            compare_candle_time = row.Index
            div = row.price_end
            if (compare_candle_time >= candle_time) & (compare_candle_time <= next_candle):
                if pd.notnull(div):
                    no_divs.append(True)
                else:
                    no_divs.append(False)

            elif compare_candle_time < candle_time:
                continue

            elif compare_candle_time > next_candle:
                break

        if all(no_divs) == False:
            return np.nan

        elif any(no_divs) == True:
            return 1

理想解决方案

显然,使用itertuples对于这个问题来说效率太低了。我认为使用高效的矢量化或Numba会有一个更快的解决方案。
有人知道怎么做吗?
附言:我对编码还是一个新手,我认为我目前的代码仍然可以使用迭代元组来提高效率,但可能还不够高效。如果有人知道如何大大提高代码的速度,我将不胜感激

vlf7wbxs

vlf7wbxs1#

我花了很多时间研究和测试不同的代码,并提出了这个解决方案使用numba,这给了一个显着的速度提升。
首先导入所需的库

import numpy as np
import pandas as pd
from numba import njit, prange

然后使用numbas njit decotator定义函数

@njit
    def filled_fg(fg_top, fg_bottom, dflongindex, Open, High, Low, Close, dflongprice_end,
                              dfshortprice_end, shortindex, dfshorterprice_end, shorterindex, conflu_top,
                              conflu_bottom):
# First loop
        for i in prange(len(fg_top)):
            top = fg_top[i]
            bottom = fg_bottom[i]
            if top is not np.nan:
                if (bottom - top) > 0:
                    fg_top[i] = np.nan
                    fg_bottom[i] = np.nan
# Second loop
                    for j in prange(len(fg_top)):
                        if j > i:
                            future_candle_time = dflongindex[j]
                            next_future_candle = dflongindex[j + 1]
                            future_candle_high = High[j]
                            future_candle_low = Low[j]
                            future_candle_close = Close[j]
                            future_candle_open = Open[j]
                            long_div = dflongprice_end[j]
# Check conditions
                            if ((new_check_no_divs(dfshortprice_end, shortindex, future_candle_time,
                                                   next_future_candle)) == np.nan) & ((new_check_no_divs(
                                dfshorterprice_end, shorterindex, future_candle_time,
                                next_future_candle)) == np.nan) & (long_div == np.nan):

                                if future_candle_high < bottom:
                                    continue

                                elif future_candle_low > top:
                                    continue

# Do something when conditions are met...
                                elif (future_candle_close < bottom) & \
                                        (future_candle_open > top):
                                    fg_bottom[i] = np.nan
                                    fg_top[i] = np.nan
                                    continue

同样使用numbas njit装饰器定义第二个函数

@njit
    def check_no_divs(div_data, div_candle_time, first_future_candle, second_future_candle):

        no_divs = []

        for i in prange(len(div_data)):
            if (div_candle_time[i] >= first_future_candle) & (div_candle_time[i] <= second_future_candle):
                if div_data[i] is not np.nan:
                    return 1
                else:
                    no_divs.append(0)
            elif div_candle_time[i] < first_future_candle:
                continue

            elif div_candle_time[i] > second_future_candle:
                break

        div_count = 0
        for i in no_divs:
            div_count = div_count + i
        if div_count == 0:
            return np.nan

在调用函数dataframe之前,需要重置索引

dflong = dflong.reset_index()
    dfshort = dfshort.reset_index()
    dfshorter = dfshorter.reset_index()

现在调用该函数并使用.values返回DataFrame的数字表示形式。

fg_bottom, fg_top = filled_fg(dflong['FG_Top'].values,
                                                 dflong['FG_Bottom'].values,                                           
                                                 dflong['index'].values,
                                                 dflong['Open'].values,
                                                 dflong['High'].values,
                                                 dflong['Low'].values,
                                                 dflong['Close'].values,
                                                 dflong['price_end'].values,
                                                 dfshort['price_end'].values,
                                                 dfshort['index'].values,
                                                 dfshorter['price_end'].values,
                                                 dfshorter['index'].values)

最后,需要将返回的数据重新添加到原始DataFrame dflong

dflong['FG_Bottom'] = fg_bottom
    dflong['FG_Top'] = fg_top

速度测试结果:
原始迭代组解= 7.641393423080444秒
新Numba解决方案= 0.5985264778137207秒

相关问题