Pandas如何在开始(1)和多个结束(2或3)之间标记行?

gstyhher  于 2023-09-29  发布在  其他
关注(0)|答案(6)|浏览(84)

我有以下dataframe:

import numpy as np
import pandas as pd

df = pd.DataFrame([])
df['Date'] = ['2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05',
              '2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10',
              '2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15',
              '2020-01-16','2020-01-17','2020-01-18','2020-01-19','2020-01-20']
df['Machine'] = ['A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A']
df['Signal'] = [0,1,2,0,1,3,0,0,0,3,0,1,0,0,3,0,1,0,0,1]
df['Status'] =  0

以及以下函数,该函数生成机器A的“状态”列。在信号col中,1将机器接通(状态col 1),直到机器接收到2或3(将机器状态切换到0(关闭)的信号),机器仍保持1,直到机器再次接收到信号1。
我已经用下面的函数解决了维护先前Status行值1或0的问题:

def s_gen(dataset, Signal):
    _status = 0
    status0 = []
    for (i) in Signal:
        if _status == 0:
            if i == 1:
                _status = 1 
        elif _status == 1:
            if (i == 2 or i==3):
                _status = 0
        status0.append(_status)
        
    dataset['status0'] = status0

    return dataset['status0']

df['Status'] = s_gen(df,df['Signal'])
df.drop('status0',axis=1,inplace = True)
df

这会将新创建的列附加到dataframe。然而,我有一个更大的dataframe,在Machine列中有许多不同的值(分组为系列; A、A、A、B、B、B等),并且函数的结果不能重叠。使用groupby不起作用。因此,我认为下一步是将“Status”的每个序列作为一个单独的列表生成,并在将整个序列作为更大的外部循环的一部分附加到更大的 Dataframe 之前将它们连接起来。
这是预期的结果:

df = pd.DataFrame([])
df['Date'] = ['2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05',
              '2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10',
              '2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15',
              '2020-01-16','2020-01-17','2020-01-18','2020-01-19','2020-01-20',
              '2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05',
              '2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10',
              '2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15',
              '2020-01-16','2020-01-17','2020-01-18','2020-01-19','2020-01-20']

df['Machine'] = ['A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A',
                'B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B',]
df['Signal'] = [0,1,2,0,1,3,0,0,0,3,0,1,0,0,3,0,1,0,0,1,0,1,2,0,1,3,0,0,0,3,0,1,0,0,3,0,1,0,0,1]
df['Status'] = [0,1,0,0,1,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1,0,1,0,0,1,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1]
df

我正在努力解决的问题是,如果函数单独处理每台机器的数据,然后将其附加到dataframe,它将不得不循环通过每台机器,然后连接产生的所有Status系列,然后将更大的系列附加到dataframe。
这是我迄今为止所尝试的:

dfList = df[df['Machine']]
dfListU = pd.DataFrame([])
dfListU = dfList['Machine'].unique()
dfListU.flatten()

def s_gen2(item, dataset, Signal):
   
    data = df[df.Machine==m]
    for m in dfListU:
        _status = 0
        status0 = []

        for (i) in Signal:
            if _status == 0:
                if i == 1:
                    _status = 1 
            elif _status == 1:
                if (i == 2 or i==3):
                    _status = 0
            #status0.append(_status)

        dataset['status0'] = status0

        return dataset['status0']
    for i in dfListU:
        df1 = pd.concat(i)
    status0.append(_status)
df['Status'] = s_gen(df,df['Signal'])
df.drop('status0',axis=1,inplace = True)
df

这将导致错误- KeyError:[索引](['A',' A','A',' A','A',' A','A',' A','A',' A','A',' A','A',' A','B',' B','B',' B','B',' B','B',' B','B',' B','B',' B','B','B','B','B','B','B','B','B','B'],\n dtype ='object')]在[列]中”
是否更好地通过dfListU(唯一机器列表)循环函数,然后连接结果?我尝试避免使用循环,但找不到任何其他方法来比较前一个状态行与Signal列中的同一行。
任何帮助都是真诚的感谢。

jobtbby3

jobtbby31#

一个简单的方法是map已知状态,然后groupby.ffill它们:

df['Status'] = (df['Signal']
 .map({1:1, 2:0, 3:0})
 .groupby(df['Machine']).ffill()
 .fillna(0, downcast='infer')
 )

输出量:

Date Machine  Signal  Status
0   2020-01-01       A       0       0
1   2020-01-02       A       1       1
2   2020-01-03       A       2       0
3   2020-01-04       A       0       0
4   2020-01-05       A       1       1
5   2020-01-06       A       3       0
6   2020-01-07       A       0       0
7   2020-01-08       A       0       0
8   2020-01-09       A       0       0
9   2020-01-10       A       3       0
10  2020-01-11       A       0       0
11  2020-01-12       A       1       1
12  2020-01-13       A       0       1
13  2020-01-14       A       0       1
14  2020-01-15       A       3       0
15  2020-01-16       A       0       0
16  2020-01-17       A       1       1
17  2020-01-18       A       0       1
18  2020-01-19       A       0       1
19  2020-01-20       A       1       1
20  2020-01-01       B       0       0
21  2020-01-02       B       1       1
22  2020-01-03       B       2       0
23  2020-01-04       B       0       0
24  2020-01-05       B       1       1
25  2020-01-06       B       3       0
26  2020-01-07       B       0       0
27  2020-01-08       B       0       0
28  2020-01-09       B       0       0
29  2020-01-10       B       3       0
30  2020-01-11       B       0       0
31  2020-01-12       B       1       1
32  2020-01-13       B       0       1
33  2020-01-14       B       0       1
34  2020-01-15       B       3       0
35  2020-01-16       B       0       0
36  2020-01-17       B       1       1
37  2020-01-18       B       0       1
38  2020-01-19       B       0       1
39  2020-01-20       B       1       1
cxfofazt

cxfofazt2#

有更好的办法。使用cumsum函数标记以2/3条件开始的不同行/块集。然后屏蔽不为1的信号值,然后按机器和块对屏蔽列进行分组,并向前填充值。

m1 = df['Signal'].ne(1)
m2 = df['Signal'].isin([2, 3])

df['Status'] = df['Signal'].mask(m1).groupby([df['Machine'], m2.cumsum()]).ffill().fillna(0)
Date Machine  Signal  Status
0   2020-01-01       A       0     0.0
1   2020-01-02       A       1     1.0
2   2020-01-03       A       2     0.0
3   2020-01-04       A       0     0.0
4   2020-01-05       A       1     1.0
5   2020-01-06       A       3     0.0
6   2020-01-07       A       0     0.0
7   2020-01-08       A       0     0.0
8   2020-01-09       A       0     0.0
9   2020-01-10       A       3     0.0
10  2020-01-11       A       0     0.0
11  2020-01-12       A       1     1.0
12  2020-01-13       A       0     1.0
13  2020-01-14       A       0     1.0
14  2020-01-15       A       3     0.0
15  2020-01-16       A       0     0.0
16  2020-01-17       A       1     1.0
17  2020-01-18       A       0     1.0
18  2020-01-19       A       0     1.0
19  2020-01-20       A       1     1.0
aurhwmvo

aurhwmvo3#

您可以使用np.select作为状态机并使用向量化代码:

import numpy as np

conds = [df['Signal'].eq(1), df['Signal'].isin([2, 3])]
vals = [1, 0]

# np.nan for df['Signal'] == 0 (default parameter)
status = pd.Series(np.select(condlist=conds, choicelist=vals, default=np.nan))

# fill forward the current status then fill remain values with 0
df['Status'] = status.groupby(df['Machine']).ffill().fillna(0).astype(int)

输出量:

>>> df
          Date Machine  Signal  Status
0   2020-01-01       A       0       0
1   2020-01-02       A       1       1
2   2020-01-03       A       2       0
3   2020-01-04       A       0       0
4   2020-01-05       A       1       1
5   2020-01-06       A       3       0
6   2020-01-07       A       0       0
7   2020-01-08       A       0       0
8   2020-01-09       A       0       0
9   2020-01-10       A       3       0
10  2020-01-11       A       0       0
11  2020-01-12       A       1       1
12  2020-01-13       A       0       1
13  2020-01-14       A       0       1
14  2020-01-15       A       3       0
15  2020-01-16       A       0       0
16  2020-01-17       A       1       1
17  2020-01-18       A       0       1
18  2020-01-19       A       0       1
19  2020-01-20       A       1       1
20  2020-01-01       B       0       0
21  2020-01-02       B       1       1
22  2020-01-03       B       2       0
23  2020-01-04       B       0       0
24  2020-01-05       B       1       1
25  2020-01-06       B       3       0
26  2020-01-07       B       0       0
27  2020-01-08       B       0       0
28  2020-01-09       B       0       0
29  2020-01-10       B       3       0
30  2020-01-11       B       0       0
31  2020-01-12       B       1       1
32  2020-01-13       B       0       1
33  2020-01-14       B       0       1
34  2020-01-15       B       3       0
35  2020-01-16       B       0       0
36  2020-01-17       B       1       1
37  2020-01-18       B       0       1
38  2020-01-19       B       0       1
39  2020-01-20       B       1       1
mbzjlibv

mbzjlibv4#

s_gen2块中有一些令人困惑的行。我怀疑它不会编译。例如,在下面的行中,m在赋值之前使用。

data = df[df.Machine==m]
for m in dfListU:

无论如何,现在您的机器列表已经分组,s_gen是非常可重用的,只需要调整一下就可以保持dataframe不变。

df = pd.DataFrame([])
df['Date'] = ['2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05',
            '2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10',
            '2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15',
            '2020-01-16','2020-01-17','2020-01-18','2020-01-19','2020-01-20',
            '2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05',
            '2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10',
            '2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15',
            '2020-01-16','2020-01-17','2020-01-18','2020-01-19','2020-01-20']

df['Machine'] = ['A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A',
'B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B',]
df['Signal'] = [0,1,2,0,1,3,0,0,0,3,0,1,0,0,3,0,1,0,0,1,0,1,2,0,1,3,0,0,0,3,0,1,0,0,3,0,1,0,0,1] 

def s_gen(Signal):
    _status = 0
    status0 = []
    for (i) in Signal:
        if _status == 0:
            if i == 1:
                _status = 1 
        elif _status == 1:
            if (i == 2 or i==3):
                _status = 0
        status0.append(_status)

    return status0

unique_machines = df['Machine'].unique()

whole_status_list = []

for m in unique_machines:
    data = df[df.Machine==m]

    whole_status_list.extend(s_gen(data["Signal"]))

df["Status"] = whole_status_list

上面的代码应该是有帮助的。

camsedfj

camsedfj5#

您可以简化s_gen函数,因为它只需要知道是否关闭/打开机器:

def s_gen(Signal):
    _status = 0
    Status = []
    for sig in Signal:
        _status = 1 if sig == 1 else 0 if sig in [2, 3] else _status
        Status.append(_status)
    return Status

然后,您可以使用groupbytransform来获取每台计算机的状态:

df['NewStatus'] = df.groupby('Machine')['Signal'].transform(s_gen)

对于您的样本数据,我们可以检查正确性:

(df['Status'] == df['NewStatus']).all()
# True
nimxete2

nimxete26#

您可以使用groupby根据Machine键将 Dataframe 拆分到不同的组,然后使用transform将您的函数应用到每个组的Signal列。请注意,我已经修改了您的s_gen函数,使其适合转换操作。

def s_gen(signal):
    _status = 0
    status0 = []
    for (i) in signal:
        if _status == 0:
            if i == 1:
                _status = 1
        elif _status == 1:
            if i == 2 or i == 3:
                _status = 0
        status0.append(_status)

    return status0

df['Status'] = df.groupby('Machine')['Signal'].transform(s_gen)

如果你想利用矢量化来加速你的过程,你可以使用以下转换:

# divide dataframe into groups based on "Machine" key
machine_groups = df.groupby("Machine")

# find out when the signal is switched (signal != 0)
df['SignalSwitch'] = (machine_groups.Signal.diff().fillna(machine_groups.Signal.transform('first')).astype(bool) & (df['Signal'] != 0))

# initialize 'Status' column
df['Status'] = None

# fill the 'Status' column with 1 if signal is 1 at the time of 'SignalSwitch', 0 if signal is 0
df.loc[df['SignalSwitch'] & (df['Signal'] == 1), 'Status'] = 1
df.loc[df['SignalSwitch'] & (df['Signal'].isin([2, 3])), 'Status'] = 0

# forward fill the remaining timestamps with previous status, fill NA (for cases when dataframe starts with signal = 0)
df['Status'] = machine_groups.Status.ffill().fillna(0)

相关问题