pandas 将nan值替换为 Dataframe 中最近日期的值

vq8itlhq  于 2022-12-09  发布在  其他
关注(0)|答案(3)|浏览(159)

我目前的问题如下所示。考虑 Dataframe :

name     score   date
0      Alice    22.0    2020-12-31
15     Alice    4.0     2005-12-31
5      Alice    1.0     2003-12-31
10     Alice    NaN     2000-12-31
6      Bob      8.0     2001-11-02
16     Bob      1.0     2004-11-02
11     Bob      5.0     2003-11-02
1      Bob      45.0    1980-11-02
12     Chuck    9.0     2003-12-03
2      Chuck    4.0     2015-12-03
7      Chuck    9.0     2001-12-03
17     Chuck    2.0     2004-12-03
18     Daren    3.0     2004-03-13
21     Daren    89.0    2015-08-13
3      Daren    NaN     2015-03-13
20     Daren    12.0    2015-05-13
8      Daren    5.0     2015-03-17
13     Daren    93.0    2003-03-13
14     Elisa    11.0    2003-01-24
9      Elisa    19.0    2001-01-24
19     Elisa    23.0    2004-01-24
4      Elisa    16.0    2010-01-24

一些score值是nan。我想用这个人的最接近的分数(w.r.t. time)来替换这些值,假设时间差在1年内,即365天。如果没有这样的时间,则保留值为nan
在上面的 Dataframe 中,Daren的nan分数值将被替换为5,因为存在在nan日期的一年内记录的分数,并且该分数是与其他可能分数的最小时间差。然而,对于Alice,不存在在nan分数的一年内记录的这样的分数。因此她的nan分数将被保持为nan
要重新生成 Dataframe ,请使用以下代码:

import numpy as np
import pandas as pd
import datetime

data = {"name": ['Alice', 'Bob', 'Chuck', 'Daren', 'Elisa', 'Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Daren','Daren'],
        "score": [22,45,4,np.nan,16,1,8,9,5,19,np.nan,5,9,93,11,4,1,2,3,23,12,89],
        "date": ['31/12/2020','11/02/1980','12/03/2015','13/03/2015','24/01/2010','31/12/2003','11/02/2001','12/03/2001','17/03/2015','24/01/2001','31/12/2000','11/02/2003','12/03/2003','13/03/2003','24/01/2003','31/12/2005','11/02/2004','12/03/2004','13/03/2004','24/01/2004','13/05/2015','13/08/2015']}
df = pd.DataFrame(data = data)
df['date'] = pd.to_datetime(df['date'])
df.sort_values(by='name')
axzmvihb

axzmvihb1#

假设df与上述问题中给出的 Dataframe 相同:

df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y') # Create a datetime out of date column
df = df.sort_values('date').reset_index(drop=True) # Sorting the dates to get closest values from NaN
nan_index = df.index[df['score'].isna()].to_list() # Get all index values where value is NaN
for idx in nan_index: # Loop over all NaN values and check condition
    if idx-1 < 0: # If no time available previously, we skip such cases as there is no score within a year.
        continue
    curr_date = df.at[idx, 'date']
    prev_date = df.at[idx-1, 'date']
    if ((curr_date - prev_date).days <= 365): # Check if time difference is within 365 days
        df.at[idx, 'score'] = df.at[idx+1, 'score'] # Replace NaN with next closest value
print(df)
3phpmpom

3phpmpom2#

首先基于names对值进行排序,并从日期列中提取年份。然后按year分组

df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['name'])
df = df.sort_values(['date'])
df['year'] = pd.DatetimeIndex(df['date']).year
print(df.groupby('year').bfill())
df.sort_index(inplace=True)

给出编号

name  score       date
0   Alice   22.0 2020-12-31
1     Bob   45.0 1980-11-02
2   Chuck    4.0 2015-12-03
3   Daren    5.0 2015-03-13
4   Elisa   16.0 2010-01-24
5   Alice    1.0 2003-12-31
6     Bob    8.0 2001-11-02
7   Chuck    9.0 2001-12-03
8   Daren    5.0 2015-03-17
9   Elisa   19.0 2001-01-24
10  Alice    NaN 2000-12-31
11    Bob    5.0 2003-11-02
12  Chuck    9.0 2003-12-03
13  Daren   93.0 2003-03-13
14  Elisa   11.0 2003-01-24
15  Alice    4.0 2005-12-31
16    Bob    1.0 2004-11-02
17  Chuck    2.0 2004-12-03
18  Daren    3.0 2004-03-13
19  Elisa   23.0 2004-01-24
20  Daren   12.0 2015-05-13
21  Daren   89.0 2015-08-13
>>>
iyfamqjs

iyfamqjs3#

data = df.sort_values(by=['name', 'date'])

data = pd.concat([data,
           data.shift(-1)[['score', 'date']].rename(columns={'score': 'score_next', 'date': 'date_next'}),
           data.shift(1)[['score', 'date']].rename(columns={'score': 'score_prev', 'date': 'date_prev'})
           ],
          axis=1)

def setter(x):
    r = x['date'] - x['date_prev']
    r = float('inf') if type(r) == pd._libs.tslibs.nattype.NaTType else abs(r.days)
    v = x['date'] - x['date_prev']
    v = float('inf') if type(v) == pd._libs.tslibs.nattype.NaTType else abs(v.days)
    if r > v:
        return x['score_prev']
    return x['score_next']
    
df.score.fillna(data[data.score.isna()].apply(setter, axis=1), inplace=True)

相关问题