Pandas:标记连续值

jyztefdp  于 2023-03-06  发布在  其他
关注(0)|答案(6)|浏览(138)

我有一个[0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1].形式的Pandas系列

0: indicates economic increase.
1: indicates economic decline.

衰退的信号是连续两次下降(1)。
连续两次增长(0)表明衰退结束。
在上面的数据集中,我有两次衰退,开始于指数3,结束于指数5,开始于指数8,结束于指数11。
我不知道如何处理Pandas的问题。我想确定经济衰退开始和结束的指数。任何帮助都将不胜感激。
下面是我的python尝试的一个解决方案。

np_decline =  np.array([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
recession_start_flag = 0
recession_end_flag = 0
recession_start = []
recession_end = []

for i in range(len(np_decline) - 1):
    if recession_start_flag == 0 and np_decline[i] == 1 and np_decline[i + 1] == 1:
        recession_start.append(i)
        recession_start_flag = 1
    if recession_start_flag == 1 and np_decline[i] == 0 and np_decline[i + 1] == 0:
        recession_end.append(i - 1)
        recession_start_flag = 0

print(recession_start)
print(recession_end)

这是一个更以Pandas为中心的方法吗?

kxeu7u2r

kxeu7u2r1#

一串1的开始满足条件

x_prev = x.shift(1)
x_next = x.shift(-1)
((x_prev != 1) & (x == 1) & (x_next == 1))

也就是说,运行开始时的值为1,前一个值不为1,下一个值为1。类似地,运行结束时满足条件

((x == 1) & (x_next == 0) & (x_next2 == 0))

因为一轮结束时的值为1,接下来的两个值value为0。我们可以使用np.flatnonzero查找满足这些条件的索引:

import numpy as np
import pandas as pd

x = pd.Series([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
x_prev = x.shift(1)
x_next = x.shift(-1)
x_next2 = x.shift(-2)
df = pd.DataFrame(
    dict(start = np.flatnonzero((x_prev != 1) & (x == 1) & (x_next == 1)),
         end = np.flatnonzero((x == 1) & (x_next == 0) & (x_next2 == 0))))
print(df[['start', 'end']])

收益率

start  end
0      3    5
1      8   11
67up9zun

67up9zun2#

您可以使用shift

df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1], columns=['signal'])
df_prev = df.shift(1)['signal']
df_next = df.shift(-1)['signal']
df_next2 = df.shift(-2)['signal']
df.loc[(df_prev != 1) & (df['signal'] == 1) & (df_next == 1), 'start'] = 1
df.loc[(df['signal'] != 0) & (df_next == 0) & (df_next2 == 0), 'end'] = 1
df.fillna(0, inplace=True)
df = df.astype(int)

    signal  start  end
0        0      0    0
1        1      0    0
2        0      0    0
3        1      1    0
4        1      0    0
5        1      0    1
6        0      0    0
7        0      0    0
8        1      1    0
9        1      0    0
10       0      0    0
11       1      0    1
12       0      0    0
13       0      0    0
14       1      0    0
rekjcdws

rekjcdws3#

使用shift的想法类似,但将结果写成单个布尔列:

# Boolean indexers for recession start and stops.
rec_start = (df['signal'] == 1) & (df['signal'].shift(-1) == 1)
rec_end = (df['signal'] == 0) & (df['signal'].shift(-1) == 0)

# Mark the recession start/stops as True/False.
df.loc[rec_start, 'recession'] = True
df.loc[rec_end, 'recession'] = False

# Forward fill the recession column with the last known Boolean.
# Fill any NaN's as False (i.e. locations before the first start/stop).
df['recession'] = df['recession'].ffill().fillna(False)

结果输出:

signal recession
0        0     False
1        1     False
2        0     False
3        1      True
4        1      True
5        1      True
6        0     False
7        0     False
8        1      True
9        1      True
10       0      True
11       1      True
12       0     False
13       0     False
14       1     False
jhdbpxl9

jhdbpxl94#

使用rolling(2)

s = pd.Series([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])

我减去.5,所以rolling的和在衰退开始时是1,在衰退停止时是-1

s2 = s.sub(.5).rolling(2).sum()

由于1-1的计算结果均为True,因此我可以将滚动信号屏蔽为仅开始和停止以及ffill。使用gt(0)获取它们何时为正或负的真值。

pd.concat([s, s2.mask(~s2.astype(bool)).ffill().gt(0)], axis=1, keys=['signal', 'isRec'])

sgtfey8w

sgtfey8w5#

您可以使用scipy.signal.find_peaks来解决这个问题。

from scipy.signal import find_peaks

    np_decline =  np.array([0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0 , 0 , 1])
    peaks = find_peaks(np_decline,width=2)
    recession_start_loc =  peaks[1]['left_bases'][0]
b5buobof

b5buobof6#

def function2(dd:pd.DataFrame):
    if dd.iat[0,1]>=2:
        if dd.query("col1==0").pipe(len)==1:
            return (dd.index.min(),dd.index.max()+1)
        else:
            dd1=dd.query("col1==1")
            return (dd1.index.min(),dd1.index.max())

col2=df1.col1.diff().eq(1).cumsum()
df1.groupby(col2).apply(lambda dd:dd.assign(col3=dd.col1.cumprod().sum()))\
    .groupby('col3',sort=False).apply(function2).dropna()

输出:

col3
3     (3, 5)
2    (8, 11)
dtype: object

相关问题