新年快乐。
我正在寻找一种方法来计算滚动窗口和固定窗口(‘补丁’)与Pandas的相关性。最终目标是进行模式匹配。
从我在文档上看到的(希望我遗漏了什么),corr()或corrwith()不允许您锁定Series/DataFrame之一。
目前,我能想到的最糟糕的解决方案是下面列出的。当在50K行和30个样本的补丁上运行时,处理时间进入Ctrl+C范围。
我非常感谢您的建议和替代方案。谢谢。
请运行下面的代码,我想要做什么就很清楚了:
import numpy as np
import pandas as pd
from pandas import Series
from pandas import DataFrame
# Create test DataFrame df and a patch to be found.
n = 10
rng = pd.date_range('1/1/2000 00:00:00', periods=n, freq='5min')
df = DataFrame(np.random.rand(n, 1), columns=['a'], index=rng)
n = 4
rng = pd.date_range('1/1/2000 00:10:00', periods=n, freq='5min')
patch = DataFrame(np.arange(n), columns=['a'], index=rng)
print
print ' ***Start corr example***'
# To avoid the automatic alignment between df and patch,
# I need to reset the index.
patch.reset_index(inplace=True, drop=True)
# Cannot do:
# df.reset_index(inplace=True, drop=True)
df['corr'] = np.nan
for i in range(df.shape[0]):
window = df[i : i+patch.shape[0]]
# If slice has only two rows, I have a line between two points
# When I corr with to points in patch, I start getting
# misleading values like 1 or -1
if window.shape[0] != patch.shape[0] :
break
else:
# I need to reset_index for the window,
# which is less efficient than doing outside the
# for loop where the patch has its reset_index done.
# If I would do the df.reset_index up there,
# I would still have automatic realignment but
# by index.
window.reset_index(inplace=True, drop=True)
# On top of the obvious inefficiency
# of this method, I cannot just corrwith()
# between specific columns in the dataframe;
# corrwith() runs for all.
# Alternatively I could create a new DataFrame
# only with the needed columns:
# df_col = DataFrame(df.a)
# patch_col = DataFrame(patch.a)
# Alternatively I could join the patch to
# the df and shift it.
corr = window.corrwith(patch)
print
print '==========================='
print 'window:'
print window
print '---------------------------'
print 'patch:'
print patch
print '---------------------------'
print 'Corr for this window'
print corr
print '============================'
df['corr'][i] = corr.a
print
print ' ***End corr example***'
print " Please inspect var 'df'"
print
2条答案
按热度按时间col17t5w1#
显然,
reset_index
的大量使用是一个信号,表明我们正在与Pandas的索引和自动对齐进行斗争。哦,如果我们能忘记索引,事情会容易得多!事实上,这就是NumPy的作用。(一般来说,需要按索引对齐或分组时使用Pandas,在N维数组上进行计算时使用NumPy。)使用NumPy将使计算速度更快,因为我们将能够删除
for-loop
并将for循环中完成的所有计算处理为在滚动窗口的NumPy数组上完成的一个计算。我们可以查看
pandas/core/frame.py
的DataFrame.corrwith
以了解计算是如何完成的。然后将其转换为在NumPy数组上完成的相应代码,根据需要进行调整,以便在保持patch
不变的同时,对充满滚动窗口的整个数组进行计算。(注意:Pandascorrwith
方法处理NAN。为了使代码更简单,我假定输入中没有NAN。)这确认了
orig
和using_numpy
生成的值值相同:技术说明:
为了以一种内存友好的方式创建充满滚动窗口的数组,I used a striding trick I learned here。
以下是一个基准测试,使用
n, m = 1000, 4
(很多行和一个小补丁来生成很多窗口):--加速2600倍。
uoifb46i2#
这里有一个错误,因为len(Df)-len(补丁)不等于len(Correl)。
镜头(Df)=10镜头(面片)=4
所以从技术上讲,我们应该有6个相关值。但Len(Correl)=7
不确定这个问题是从哪里来的