pandas 面板数据-创建提前和滞后变量时处理缺失年份

sycxhyv7  于 2022-12-16  发布在  其他
关注(0)|答案(2)|浏览(325)

我使用面板数据。通常,我的面板数据是不平衡的,即,有一些缺失的年份。面板数据的一般外观如下:

df = pd.DataFrame({'name': ['a']*4+['b']*3+['c']*4,
                   'year':[2001,2002,2004,2005]+[2000,2002,2003]+[2001,2002,2003,2005],
                   'val1':[1,2,3,4,5,6,7,8,9,10,11],
                   'val2':[2,5,7,11,13,17,19,23,29,31,37]})

   name  year  val1  val2
0     a  2001     1     2
1     a  2002     2     5
2     a  2004     3     7
3     a  2005     4    11
4     b  2000     5    13
5     b  2002     6    17
6     b  2003     7    19
7     c  2001     8    23
8     c  2002     9    29
9     c  2003    10    31
10    c  2005    11    37

现在我想创建leadlag变量,它们是groupbyname

df['val1_lag'] = df.groupby('name')['val1'].shift(1)
df['val1_lead'] = df.groupby('name')['val1'].shift(-1)

这只是向上/向下移动一行,这不是我想要的。我想相对于year移入。我的预期输出:

name  year  val1  val2  val1_lag  val1_lead
0     a  2001     1     2       NaN        2.0
1     a  2002     2     5       1.0        NaN
2     a  2004     3     7       NaN        4.0
3     a  2005     4    11       3.0        NaN
4     b  2000     5    13       NaN        NaN
5     b  2002     6    17       NaN        7.0
6     b  2003     7    19       6.0        NaN
7     c  2001     8    23       NaN        9.0
8     c  2002     9    29       8.0       10.0
9     c  2003    10    31       9.0        NaN
10    c  2005    11    37       NaN        NaN

我当前的变通解决方案是通过以下方式填充missing year:

df.set_index(['name', 'year'], inplace=True)
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]], names=['name', 'year'])
df = df.reindex(mux).reset_index()

然后使用正常的shift。但是,因为我的数据大小相当大。使用这个经常x3的数据大小,这是不是很有效率在这里。
我正在寻找一种更好的方法来应对这种情况

d5vmydt9

d5vmydt91#

解决方案是,如果该年是连续的,则创建一个check列,将check列设置为1.0,然后将np.NaN乘以正常的groupby

df['yearlag'] = (df['year'] == 1 + df.groupby('name')['year'].shift(1))*1.0
df.loc[df['yearlag']==0.0, 'yearlag'] = None

df['yearlead'] = (df['year'] == -1 + df.groupby('name')['year'].shift(-1))*1.0
df.loc[df['yearlead']==0.0, 'yearlead'] = None

要创建滞后提前量变量:

%timeit df['val1_lag'] = df.groupby('name')['val1'].shift(1)*df['yearlag']

你可以用上面的merge方法检查一下,效率会高很多

%timeit df['val1_lag'] = df[['name', 'year']].merge(df.eval('year=year+1'), how='left')['val1']
qco9c6ql

qco9c6ql2#

不要使用shift,而是使用年份为± 1的merge

df['val1_lag'] = df[['name', 'year']].merge(df.eval('year = year+1'), how='left')['val1']
df['val1_lead'] = df[['name', 'year']].merge(df.eval('year = year-1'), how='left')['val1']

输出:

name  year  val1  val2  val1_lag  val1_lead
0     a  2001     1     2       NaN        2.0
1     a  2002     2     5       1.0        NaN
2     a  2004     3     7       NaN        4.0
3     a  2005     4    11       3.0        NaN
4     b  2000     5    13       NaN        NaN
5     b  2002     6    17       NaN        7.0
6     b  2003     7    19       6.0        NaN
7     c  2001     8    23       NaN        9.0
8     c  2002     9    29       8.0       10.0
9     c  2003    10    31       9.0        NaN
10    c  2005    11    37       NaN        NaN

相关问题