pandas 从列表列中提取日期值并将其放入新列

dgtucam1  于 2023-03-06  发布在  其他
关注(0)|答案(1)|浏览(182)

我有一个 Dataframe ,其中包含一个列ABDAT,该列包含一个列表。ABDAT中的元素数量是可变的。ABDAT中的大多数元素都是NaT值。ABDAT中最多有三个有效日期。如果ABDAT有一个日期,我希望将其放在名为ABDAT1的新列中。如果ABDAT有两个有效日期,我希望将第一个日期放在ABDAT1中,将第二个日期放在ABDAT2中。如果ABDAT有三个日期,我想将第一个放在ABDAT1中,第二个放在ABDAT2中,第三个放在ABDAT3中。以下是 Dataframe 中ABDAT列的示例。

ID    FDAT        ABDAT
0   1   2004-08-17  [2004-08-17 00:00:00, NaT, NaT, NaT]
1   1   2005-07-10  [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT]
2   1   2006-06-12  [NaT, NaT]
3   3   2001-12-20  [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT]
4   3   2003-07-14  [2001-02-17 00:00:00, NaT, NaT, NaT, NaT, NaT]
5   3   2004-06-01  [NaT, 2012-12-21 00:00:00, NaT, NaT, NaT, NaT, NaT, 2018-05-01 00:00:00, NaT]

我想创造

ID    FDAT        ABDAT1       ABDAT2        ABDAT3
0   1   2004-08-17  2004-08-17      NaT            NaT
1   1   2005-07-10  NaT             NaT            NaT
2   1   2006-06-12  NaT             NaT            NaT
3   3   2001-12-20  NaT             NaT            NaT
4   3   2003-07-14  2001-02-17      NaT            NaT
5   3   2004-06-01  2012-12-21    2018-05-01       NaT

Dataframe 相当大(800000行)
谢谢

s4n0splo

s4n0splo1#

其实很简单:

# Import the pandas library
import pandas as pd

# Create a sample dataframe
data = {'ID': [1, 1, 1, 3, 3, 3],
        'FDAT': ['2004-08-17', '2005-07-10', '2006-06-12', '2001-12-20', '2003-07-14', '2004-06-01'],
        'ABDAT': [['2004-08-17 00:00:00', pd.NaT, pd.NaT, pd.NaT],
                  [pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT],
                  [pd.NaT, pd.NaT],
                  [pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT],
                  ['2001-02-17 00:00:00', pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT],
                  [pd.NaT, '2012-12-21 00:00:00', pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, '2018-05-01 00:00:00', pd.NaT]]}

df = pd.DataFrame(data)

df['ABDAT1'] = pd.NaT
df['ABDAT2'] = pd.NaT
df['ABDAT3'] = pd.NaT

for i in range(len(df)):
    dates = [date for date in df.loc[i, 'ABDAT'] if not pd.isnull(date)]
    if len(dates) >= 1:
        df.at[i, 'ABDAT1'] = dates[0]
    if len(dates) >= 2:
        df.at[i, 'ABDAT2'] = dates[1]
    if len(dates) >= 3:
        df.at[i, 'ABDAT3'] = dates[2]

df = df.drop('ABDAT', axis = 1)
print(df)

这就给了你

ID        FDAT     ABDAT1     ABDAT2 ABDAT3
0   1  2004-08-17 2004-08-17        NaT    NaT
1   1  2005-07-10        NaT        NaT    NaT
2   1  2006-06-12        NaT        NaT    NaT
3   3  2001-12-20        NaT        NaT    NaT
4   3  2003-07-14 2001-02-17        NaT    NaT
5   3  2004-06-01 2012-12-21 2018-05-01    NaT

但是,在这里我假设你没有超过3个可能的日期。你可以这样概括:

import pandas as pd
import numpy as np

data = {'ID': [1, 1, 1, 3, 3, 3],
        'FDAT': ['2004-08-17', '2005-07-10', '2006-06-12', '2001-12-20', '2003-07-14', '2004-06-01'],
        'ABDAT': [['2004-08-17 00:00:00', 'NaT', 'NaT', 'NaT'],
                  ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
                  ['NaT', 'NaT'],
                  ['NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
                  ['2001-02-17 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'],
                  ['NaT', '2012-12-21 00:00:00', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', '2018-05-01 00:00:00', 'NaT']]
        }

df = pd.DataFrame(data)

def extract_dates(row):
    dates = [d for d in row['ABDAT'] if not pd.isna(d)]
    num_dates = len(dates)
    for i in range(num_dates):
        col_name = f'ABDAT{i+1}'
        row[col_name] = dates[i]
    return row

df = df.apply(extract_dates, axis=1)

df.replace([np.datetime64('NaT')], [np.nan], inplace=True)

print(df)

它给出了

ABDAT               ABDAT1  \
0               [2004-08-17 00:00:00, NaT, NaT, NaT]  2004-08-17 00:00:00   
1      [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT]                  NaT   
2                                         [NaT, NaT]                  NaT   
3  [NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, NaT, ...                  NaT   
4     [2001-02-17 00:00:00, NaT, NaT, NaT, NaT, NaT]  2001-02-17 00:00:00   
5  [NaT, 2012-12-21 00:00:00, NaT, NaT, NaT, NaT,...                  NaT   

  ABDAT10 ABDAT11 ABDAT12 ABDAT13               ABDAT2 ABDAT3 ABDAT4 ABDAT5  \
0     NaN     NaN     NaN     NaN                  NaT    NaT    NaT    NaN   
1     NaN     NaN     NaN     NaN                  NaT    NaT    NaT    NaT   
2     NaN     NaN     NaN     NaN                  NaT    NaN    NaN    NaN   
3     NaT     NaT     NaT     NaT                  NaT    NaT    NaT    NaT   
4     NaN     NaN     NaN     NaN                  NaT    NaT    NaT    NaT   
5     NaN     NaN     NaN     NaN  2012-12-21 00:00:00    NaT    NaT    NaT   

  ABDAT6 ABDAT7               ABDAT8 ABDAT9        FDAT  ID  
0    NaN    NaN                  NaN    NaN  2004-08-17   1  
1    NaT    NaT                  NaT    NaT  2005-07-10   1  
2    NaN    NaN                  NaN    NaN  2006-06-12   1  
3    NaT    NaT                  NaT    NaT  2001-12-20   3  
4    NaT    NaN                  NaN    NaN  2003-07-14   3  
5    NaT    NaT  2018-05-01 00:00:00    NaT  2004-06-01   3

相关问题