pandas 如何使用np.array来加快计算速度而不是循环(有一些前提条件)?

omvjsjqw  于 2023-03-06  发布在  其他
关注(0)|答案(2)|浏览(115)

我需要使用下面的代码;有没有可能使用np.array来完成精确的计算并更快地得到相同的结果?

data['daily_change'] = data.groupby('title',group_keys=False)['return'].pct_change()
for title in data['title'].unique(): # Iterate through each title
     
    temp_df = data[data['title'] == title].tail(252) # Select the data for a specific title
    if len(temp_df) < 252:
        print(f"{title} has less than 1 year of data, ignore\n")
        continue

    sections = [temp_df.iloc[i:i+63] for i in range(0, 252, 63)] # Divide the data into 4 sections

    if method1:
        result = sum([(section['return'].iloc[-1] / section['return'].iloc[0]) * weight for section, weight in zip(sections, [0.2]*3 + [0.4])]) # Calculate the weighted return
    else:
        # Calculate the weighted return using the daily changes
        result = sum([(1 + section['daily_change']).prod() * weight for section, weight in zip(sections, [0.2]*3 + [0.4])]) - 1

    df_new = pd.concat([df_new, pd.DataFrame({'title': [title], 'result': [result]})], ignore_index=True)

其他信息:

下面是示例数据https://www.dropbox.com/s/ehawttyt2rhrkx5/sample.csv?dl=0
方法1的预期结果

A: 1.00105
B: 1.03288
C: 1.13492
D: 0.966295
E: 1.06095
F: 1.02021

else条件的预期结果:

A: 0.00526707    
B: 0.0433293
C: 0.14446
D: -0.0129632
E: 0.0601407
F: 0.0263727

简短描述我想做的事情:
1.分别计算每一个标题的回报的每日变化。
1.每个标题只取最近的252个数据点。
1.将每个标题的数据点分为四个部分。
1.对每个标题运行方法1和其他计算。
方法1是取每个部分中最后一个数据点和第一个数据点之间的百分比差异,乘以各自的权重,然后求和。
否则,取每日变化乘积,乘以各自的权重,然后全部加起来。

of1yzvn4

of1yzvn41#

通过几个假设(特别是关于method1df_new),您当前的代码运行在...

%%timeit
# idk what `method1` is so set it to `True`
method1 = True
# idk what `df_new` is so making an 
# empty dataframe to replicate your code
df_new = pd.DataFrame()

data['daily_change'] = data.groupby('title',group_keys=False)['return'].pct_change()
for title in data['title'].unique(): # Iterate through each title
     
    temp_df = data[data['title'] == title].tail(252) # Select the data for a specific title
    if len(temp_df) < 252:
        print(f"{title} has less than 1 year of data, ignore\n")
        continue

    sections = [temp_df.iloc[i:i+63] for i in range(0, 252, 63)] # Divide the data into 4 sections

    if method1:
        result = sum([(section['return'].iloc[-1] / section['return'].iloc[0]) * weight for section, weight in zip(sections, [0.2]*3 + [0.4])]) # Calculate the weighted return
    else:
        # Calculate the weighted return using the daily changes
        result = sum([(1 + section['daily_change']).prod() * weight for section, weight in zip(sections, [0.2]*3 + [0.4])]) - 1

    df_new = pd.concat([df_new, pd.DataFrame({'title': [title], 'result': [result]})], ignore_index=True)
7.48 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

一些(可能过度优化)改进

import numpy as np
import pandas as pd

# define a function that can be applied over groups
def func(data: pd.DataFrame, method1: bool = True) -> float:
    # take the last 252 rows
    df = data.tail(252)
    # return nothing if less than 252 rows
    if len(df) < 252:
        return
    # storing `weights` at (4,) array
    weights = np.array([*[0.2]*3, 0.4])
    # idk what method1 is so assuming it's a boolean
    if method1:
        # split `df["return"]` into quarters
        sections = np.array(np.array_split(
            ary=df["return"].to_list(), indices_or_sections=4
        ))
        # shave off a few ms...
        result = np.multiply(
            np.divide(sections[:, -1], sections[:, 0]), weights
        ).sum()
    else:
        # split `df["daily_change"]` into quarters
        sections = np.array(np.array_split(
            ary=df["daily_change"].to_list(), indices_or_sections=4
        ))
        # shave off a few ms...
        result = np.multiply(
            np.add(sections, 1).prod(axis=1), weights
        ).sum() - 1
    
    return result
%%timeit
# `method1` is True
data.groupby("title").apply(func, method1=True)
3.68 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
# `method1` is False
data.groupby("title").apply(func, method1=False)
3.87 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

根据给定的数据,我能够将时间缩短大约50%。

注解

如果"title"组没有足够的行,函数将不返回任何内容,结果可能如下所示:

dfc = data.sample(1_500, random_state=0)

# these numbers will not be correct due to sampling
print(dfc.groupby("title").apply(func))
title
A         NaN
B         NaN
C         NaN
D         NaN
E    1.282753
F    0.928689
dtype: float64
vngu2lb8

vngu2lb82#

import numpy as np

for title in data['title'].unique(): # Iterate through each title
     
     temp_df = data[data['title'] == title].tail(252) # Select the data for a specific title
     if len(temp_df) < 252:
         print(f"{title} has less than 1 year of data, ignore\n")
         continue
     
     sections = [temp_df.iloc[i:i+63] for i in range(0, 252, 63)] # Divide the data into 4 sections
     sections_returns = np.array([section['return'] for section in sections])
     sections_daily_change = np.array([section['daily_change'] for section in sections])
     weights = np.array([0.2]*3 + [0.4])
     
     if method1:
         result = np.sum((sections_returns[:, -1] / sections_returns[:, 0]) * weights) # Calculate the weighted return
     else:
         # Calculate the weighted return using the daily changes
         result = np.sum((1 + sections_daily_change).prod(axis=1) * weights) - 1
         
     df_new = pd.concat([df_new, pd.DataFrame({'title': [title], 'result': [result]})], ignore_index=True)

主要变化如下:
使用np.array存储每个部分的收益和每日变化数据使用np.sum计算收益或每日变化的加权和使用prod方法中的axis=1计算沿着行轴的每日变化的乘积这应该会加快大型数据集的计算时间。

相关问题