使用Groupby Date进行Pandas插值以填充缺失值

roejwanj  于 2023-06-20  发布在  其他
关注(0)|答案(1)|浏览(124)

我想有人向我解释插值方法的工作原理和groupby。这个问题在功能上得到了解决,但是,它相当缓慢,而且我仍然不理解方法的行为。
数据的快照:

print(df)
   archiveDatefut     vol
0       6/16/2014  1.4950
1       6/16/2014  1.1775
2       6/16/2014  1.0275
3       6/16/2014     NaN
4       6/16/2014     NaN
5       6/16/2014     NaN
6       6/16/2014  1.0125
7       6/16/2014  0.7050
8       6/16/2014  0.5550
9       6/16/2014  0.4938
10      6/16/2014     NaN
11      6/16/2014     NaN
12      6/16/2014  0.4700
13      6/16/2014  0.3919
14      6/16/2014  0.3694
15      6/16/2014  0.3488
16      6/16/2014     NaN
17      6/16/2014     NaN
18      6/16/2014     NaN
19      6/17/2014  1.4950
20      6/17/2014  1.1800
21      6/17/2014  1.0300
22      6/17/2014     NaN
23      6/17/2014     NaN
24      6/17/2014     NaN
25      6/17/2014  1.0138
26      6/17/2014  0.7050
27      6/17/2014  0.5550
28      6/17/2014  0.4938
29      6/17/2014     NaN

因此,我有上面的数据,我想按archiveDateFut分组,用插值按天填充缺失的vol列值。我尝试了以下方法。

#set index
df.index = pd.to_datetime(df.archiveDatefut)

#method 1 (I don't expect or want this version)

df['vol1'] = df['vol'].groupby(df.index.date).apply(lambda x: x).interpolate(method='linear').reset_index()['vol'].values


#method 2 (this is what I want)

vols = [] 

for _,day in df.groupby(df.index.date):
    volday = day.vol.interpolate(method='linear')
    vols.append(volday)
    

df['vol2'] = pd.concat(vols).values

print(df)
archiveDatefut     vol      vol1      vol2
archiveDatefut                                           
2014-06-16          6/16/2014  1.4950  1.495000  1.495000
2014-06-16          6/16/2014  1.1775  1.177500  1.177500
2014-06-16          6/16/2014  1.0275  1.027500  1.027500
2014-06-16          6/16/2014     NaN  1.023750  1.023750
2014-06-16          6/16/2014     NaN  1.020000  1.020000
2014-06-16          6/16/2014     NaN  1.016250  1.016250
2014-06-16          6/16/2014  1.0125  1.012500  1.012500
2014-06-16          6/16/2014  0.7050  0.705000  0.705000
2014-06-16          6/16/2014  0.5550  0.555000  0.555000
2014-06-16          6/16/2014  0.4938  0.493800  0.493800
2014-06-16          6/16/2014     NaN  0.485867  0.485867
2014-06-16          6/16/2014     NaN  0.477933  0.477933
2014-06-16          6/16/2014  0.4700  0.470000  0.470000
2014-06-16          6/16/2014  0.3919  0.391900  0.391900
2014-06-16          6/16/2014  0.3694  0.369400  0.369400
2014-06-16          6/16/2014  0.3488  0.348800  0.348800
2014-06-16          6/16/2014     NaN  0.635350  0.348800
2014-06-16          6/16/2014     NaN  0.921900  0.348800
2014-06-16          6/16/2014     NaN  1.208450  0.348800
2014-06-17          6/17/2014  1.4950  1.495000  1.495000
2014-06-17          6/17/2014  1.1800  1.180000  1.180000
2014-06-17          6/17/2014  1.0300  1.030000  1.030000
2014-06-17          6/17/2014     NaN  1.025950  1.025950
2014-06-17          6/17/2014     NaN  1.021900  1.021900
2014-06-17          6/17/2014     NaN  1.017850  1.017850
2014-06-17          6/17/2014  1.0138  1.013800  1.013800
2014-06-17          6/17/2014  0.7050  0.705000  0.705000
2014-06-17          6/17/2014  0.5550  0.555000  0.555000
2014-06-17          6/17/2014  0.4938  0.493800  0.493800
2014-06-17          6/17/2014     NaN  0.493800  0.493800

因此,乍一看,一切看起来都是一样的,但是,如果插值值之间没有值,则我所期望的会有很大的差异。请注意以下内容:

2014-06-16          6/16/2014     NaN  0.635350  0.348800
2014-06-16          6/16/2014     NaN  0.921900  0.348800
2014-06-16          6/16/2014     NaN  1.208450  0.348800

我似乎不知道这两种方法之间的区别是什么,以及为什么会发生这种情况。当结束时没有值时,与期望值的差异会持续发生,因此在新的一天开始时,在此之前的值会出现这种意想不到的行为。
方法2正好给了我想要的东西,然而,它慢得令人望而却步。

qvk1mo1f

qvk1mo1f1#

你可以尝试使用transform函数代替for循环

df['vol2'] = df.groupby(df.index.date)['vol'].transform(lambda x: x.interpolate())

相关问题