我想有人向我解释插值方法的工作原理和groupby。这个问题在功能上得到了解决,但是,它相当缓慢,而且我仍然不理解方法的行为。
数据的快照:
print(df)
archiveDatefut vol
0 6/16/2014 1.4950
1 6/16/2014 1.1775
2 6/16/2014 1.0275
3 6/16/2014 NaN
4 6/16/2014 NaN
5 6/16/2014 NaN
6 6/16/2014 1.0125
7 6/16/2014 0.7050
8 6/16/2014 0.5550
9 6/16/2014 0.4938
10 6/16/2014 NaN
11 6/16/2014 NaN
12 6/16/2014 0.4700
13 6/16/2014 0.3919
14 6/16/2014 0.3694
15 6/16/2014 0.3488
16 6/16/2014 NaN
17 6/16/2014 NaN
18 6/16/2014 NaN
19 6/17/2014 1.4950
20 6/17/2014 1.1800
21 6/17/2014 1.0300
22 6/17/2014 NaN
23 6/17/2014 NaN
24 6/17/2014 NaN
25 6/17/2014 1.0138
26 6/17/2014 0.7050
27 6/17/2014 0.5550
28 6/17/2014 0.4938
29 6/17/2014 NaN
因此,我有上面的数据,我想按archiveDateFut分组,用插值按天填充缺失的vol列值。我尝试了以下方法。
#set index
df.index = pd.to_datetime(df.archiveDatefut)
#method 1 (I don't expect or want this version)
df['vol1'] = df['vol'].groupby(df.index.date).apply(lambda x: x).interpolate(method='linear').reset_index()['vol'].values
#method 2 (this is what I want)
vols = []
for _,day in df.groupby(df.index.date):
volday = day.vol.interpolate(method='linear')
vols.append(volday)
df['vol2'] = pd.concat(vols).values
print(df)
archiveDatefut vol vol1 vol2
archiveDatefut
2014-06-16 6/16/2014 1.4950 1.495000 1.495000
2014-06-16 6/16/2014 1.1775 1.177500 1.177500
2014-06-16 6/16/2014 1.0275 1.027500 1.027500
2014-06-16 6/16/2014 NaN 1.023750 1.023750
2014-06-16 6/16/2014 NaN 1.020000 1.020000
2014-06-16 6/16/2014 NaN 1.016250 1.016250
2014-06-16 6/16/2014 1.0125 1.012500 1.012500
2014-06-16 6/16/2014 0.7050 0.705000 0.705000
2014-06-16 6/16/2014 0.5550 0.555000 0.555000
2014-06-16 6/16/2014 0.4938 0.493800 0.493800
2014-06-16 6/16/2014 NaN 0.485867 0.485867
2014-06-16 6/16/2014 NaN 0.477933 0.477933
2014-06-16 6/16/2014 0.4700 0.470000 0.470000
2014-06-16 6/16/2014 0.3919 0.391900 0.391900
2014-06-16 6/16/2014 0.3694 0.369400 0.369400
2014-06-16 6/16/2014 0.3488 0.348800 0.348800
2014-06-16 6/16/2014 NaN 0.635350 0.348800
2014-06-16 6/16/2014 NaN 0.921900 0.348800
2014-06-16 6/16/2014 NaN 1.208450 0.348800
2014-06-17 6/17/2014 1.4950 1.495000 1.495000
2014-06-17 6/17/2014 1.1800 1.180000 1.180000
2014-06-17 6/17/2014 1.0300 1.030000 1.030000
2014-06-17 6/17/2014 NaN 1.025950 1.025950
2014-06-17 6/17/2014 NaN 1.021900 1.021900
2014-06-17 6/17/2014 NaN 1.017850 1.017850
2014-06-17 6/17/2014 1.0138 1.013800 1.013800
2014-06-17 6/17/2014 0.7050 0.705000 0.705000
2014-06-17 6/17/2014 0.5550 0.555000 0.555000
2014-06-17 6/17/2014 0.4938 0.493800 0.493800
2014-06-17 6/17/2014 NaN 0.493800 0.493800
因此,乍一看,一切看起来都是一样的,但是,如果插值值之间没有值,则我所期望的会有很大的差异。请注意以下内容:
2014-06-16 6/16/2014 NaN 0.635350 0.348800
2014-06-16 6/16/2014 NaN 0.921900 0.348800
2014-06-16 6/16/2014 NaN 1.208450 0.348800
我似乎不知道这两种方法之间的区别是什么,以及为什么会发生这种情况。当结束时没有值时,与期望值的差异会持续发生,因此在新的一天开始时,在此之前的值会出现这种意想不到的行为。
方法2正好给了我想要的东西,然而,它慢得令人望而却步。
1条答案
按热度按时间qvk1mo1f1#
你可以尝试使用
transform
函数代替for循环