pandas 计算更改百分比列的高效且高性能的方法

polhcujo  于 2022-12-28  发布在  其他
关注(0)|答案(1)|浏览(128)

我正在使用yfinance通过Pandas数据读取器下载多个符号的多年数据,并试图计算'MTDChg','YTDChg',我认为这是运行时最慢的部分之一。
下面是代码片段,我对选择前一个周期的末尾(即索引本身中数据的可用性)有所保留。它是一个具有多列的DataFrame。
我很好奇,并试图找出是否有更好的方法来解决这个问题。使用X11M1N1X看起来很吸引人,但我担心我将无法使用实际的参考开始或结束时期,其数据可以存在于或不存在于索引中。我正在考虑使用applymap,但不确定如何在功能代码方面进行,以及在性能方面是否会更好。
有什么想法或建议,我该怎么做呢?

import yfinance as yf
    import pandas_datareader as pdr
    import datetime as dt
    from pandas_datareader import data as pdr
    yf.pdr_override()
 
    y_symbols = ['GOOG', 'MSFT', 'TSLA']
    price_feed = pdr.get_data_yahoo(y_symbols, 
                                      start = dt.datetime(2020,1,1),
                                      end = dt.datetime(2022,12,1),
                                      interval = "1d")

    for dt in price_feed.index:
        dt_str = dt.strftime("%Y-%m-%d")
        current_month_str = f"{dt.year}-{dt.month}"
        previous_month_str = f"{dt.year}-{dt.month - 1}"
        current_year_str = f"{dt.year}"
        previous_year_str = f"{dt.year - 1}"
        
        
        if previous_month_str in price_feed.index:
            previous_month_last_day = price_feed.loc[previous_month_str].index[-1].strftime("%Y-%m-%d")
        else:
            previous_month_last_day = price_feed.loc[current_month_str].index[0].strftime("%Y-%m-%d")
            
            
        if previous_year_str in price_feed.index:
            previous_year_last_day = price_feed.loc[previous_year_str].index[-1].strftime("%Y-%m-%d")
        else:
            previous_year_last_day = price_feed.loc[current_year_str].index[0].strftime("%Y-%m-%d")
            
            
        if dt.month == 1 or dt.month == 2 or dt.month == 3:
            previous_qtr_str = f"{dt.year - 1}-12"
            current_qtr_str  = f"{dt.year}-01"
        elif dt.month == 4 or dt.month == 5 or dt.month == 6:
            previous_qtr_str = f"{dt.year}-03"
            current_qtr_str  = f"{dt.year}-04"
        elif dt.month == 7 or dt.month == 8 or dt.month == 9:
            previous_qtr_str = f"{dt.year}-06"
            current_qtr_str  = f"{dt.year}-07"
        elif dt.month == 10 or dt.month == 11 or dt.month == 12:
            previous_qtr_str = f"{dt.year}-09"
            current_qtr_str  = f"{dt.year}-10"
        else:
            previous_qtr_str = f"{dt.year}-09"
            current_qtr_str  = f"{dt.year}-10"
                    
        if previous_qtr_str in price_feed.index:
            #print("Previous quarter string is present in price feed for ", dt_str)
            previous_qtr_last_day = price_feed.loc[previous_qtr_str].index[-1].strftime("%Y-%m-%d")
            #print("Last quarter last day is", previous_qtr_last_day)
        elif current_qtr_str in price_feed.index:
            previous_qtr_last_day = price_feed.loc[current_qtr_str].index[0].strftime("%Y-%m-%d")
            #print("Previous quarter is not present in price feed")
            #print("Last quarter last day is", previous_qtr_last_day)
        else:
            previous_qtr_last_day = price_feed.loc[current_month_str].index[0].strftime("%Y-%m-%d")
            #print("Previous quarter string is NOT present in price feed")
            #print("Last quarter last day is", previous_qtr_last_day)
            
        #print(dt.day, current_month_str, previous_month_last_day)
        for symbol in y_symbols:
            #print(symbol, dt.day, previous_month_last_day, "<--->", pivot_calculations.loc[dt, ('Close', symbol)],  pivot_calculations.loc[previous_month_last_day, ('Close', symbol)])
            mtd_perf = (pivot_calculations.loc[dt, ('Close', symbol)] - pivot_calculations.loc[previous_month_last_day, ('Close', symbol)]) / pivot_calculations.loc[previous_month_last_day, ('Close', symbol)] * 100
            pivot_calculations.loc[dt_str, ('MTDChg', symbol)] = round(mtd_perf, 2)
            # calculate the qtd performance values
            qtd_perf = (pivot_calculations.loc[dt, ('Close', symbol)] - pivot_calculations.loc[previous_qtr_last_day, ('Close', symbol)]) / pivot_calculations.loc[previous_qtr_last_day, ('Close', symbol)] * 100
            pivot_calculations.loc[dt_str, ('QTDChg', symbol)] = round(qtd_perf, 2)
            ytd_perf = (pivot_calculations.loc[dt, ('Close', symbol)] - pivot_calculations.loc[previous_year_last_day, ('Close', symbol)]) / pivot_calculations.loc[previous_year_last_day, ('Close', symbol)] * 100
            pivot_calculations.loc[dt_str, ('YTDChg', symbol)] = round(qtd_perf, 2)```
dgenwo3n

dgenwo3n1#

IIUC,您正在查看期初至今,例如“月初至今”百分比变化计算。以下方法在11.6毫秒内完成所有三种期间类型的计算。

第一个:定义

formal definition of "month to date"为:
从当前日历月的月初开始到当前日期结束的期间。月初至今用于许多上下文中,主要用于记录某个日期(不包括该日期,因为这一天可能尚未“完成”)和当前月初之间的活动结果。

然而,更常见的是,并且与您想要的计算一致:“* 从上一期间的关闭到当前日期的关闭(含)*"。例如,'2020-01-07'的(通常,非正式)MTD更改将是close('2019-12-31')close('2020-01-07')之间的更改。

你引入了一个我认为有点危险的转折:如果数据中不存在基准(前一时段的最后一天),则使用当前时段的第一天(我认为您更喜欢使用Open而不是Close作为初始时段的基准)。我认为将数据提取时间提前几天更安全、更正确,并且在计算之后,丢弃这些额外的天数(请参见下面的“附录”)。
无论如何,这里有一种方法可以达到您的要求。我们首先计算所需期间的basis。例如,对于“本月至今”,该基准将为:

basis = price_feed['Close'].shift().fillna(
    price_feed['Open']).groupby(pd.Grouper(freq='M')).transform('first')

为了核实这一依据:

>>> basis.loc['2020-01-30':'2020-02-04']
                 GOOG        MSFT       TSLA
Date                                        
2020-01-30  67.077499  158.779999  28.299999
2020-01-31  67.077499  158.779999  28.299999
2020-02-03  71.711502  170.229996  43.371334
2020-02-04  71.711502  170.229996  43.371334

请注意,每月的每一天都有上个月的收盘价(如果可用)。对于第一个月(上个月不可用),我们使用当月的开盘价。
现在,百分比变化很简单:

>>> 100 * (price_feed['Close'] - basis) / basis
                GOOG      MSFT       TSLA
Date                                     
2020-01-02  1.924640  1.158834   1.356893
2020-01-03  1.424468 -0.100771   4.360428
2020-01-06  3.925315  0.157451   6.369850
...              ...       ...        ...
2022-11-28  1.679692  4.148533 -19.609737
2022-11-29  0.824000  3.532502 -20.528256
2022-11-30  7.173033  9.912546 -14.432626

将所有相关期间的所有信息汇总:

gb = price_feed['Close'].shift().fillna(price_feed['Open']).groupby
out = {
    name: 100 * (price_feed['Close'] - basis) / basis
    for name, freq in [
        ('MTD', 'M'),
        ('QTD', 'Q'),
        ('YTD', 'Y')
    ]
    for basis in [gb(pd.Grouper(freq=freq)).transform('first')]
}

>>> out['YTD']
                 GOOG       MSFT       TSLA
Date                                       
2020-01-02   1.924640   1.158834   1.356893
2020-01-03   1.424468  -0.100771   4.360428
2020-01-06   3.925315   0.157451   6.369850
...               ...        ...        ...
2022-11-28 -33.473645 -28.116083 -48.072448
2022-11-29 -34.033502 -28.541271 -48.665759
2022-11-30 -29.879496 -24.137728 -44.728328
附录:更安全的方法(在启动前多加几天负载)

如上所述,提前几天加载,然后截短会更安全(更正确):

y_symbols = ['GOOG', 'MSFT', 'TSLA']
s, e = pd.Timestamp('2020-01-01'), pd.Timestamp('2022-12-01')
price_feed = pdr.get_data_yahoo(y_symbols, start=s - pd.Timedelta('7 days'), end=e, interval='1d')

def pctchg(price_feed, s, periods=(('MTD', 'M'), ('QTD', 'Q'), ('YTD', 'Y'))):
    gb = price_feed['Close'].shift().truncate(before=s).groupby
    return {
        name: 100 * (price_feed['Close'] - basis).dropna() / basis
        for name, freq in periods
        for basis in [gb(pd.Grouper(freq=freq)).transform('first')]
    }

>>> pctchg(price_feed, s)['YTD']
                 GOOG       MSFT       TSLA
Date                                       
2020-01-02   2.269976   1.851616   2.851818
2020-01-03   1.768110   0.583385   5.899652
2020-01-06   4.277430   0.843375   7.938711
...               ...        ...        ...
2022-11-28 -33.473645 -28.116083 -48.072448
2022-11-29 -34.033502 -28.541271 -48.665759
2022-11-30 -29.879496 -24.137728 -44.728328
附录2:性能

所有的计算都是矢量化的,所以我们希望速度快一些。让我们检查一下(上面的“更安全”版本):

%timeit pctchg(price_feed, s)
# 11.6 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

相关问题