pandas 如何使用相邻值正向填充特定值

cbjzeqam  于 2023-01-07  发布在  其他
关注(0)|答案(6)|浏览(157)

比如,我有一个如下所示的Pandas系列(空表示缺少值),为了简单起见,我在这里使用整数索引,但实际上它是datatimeindex。

0,
1,5
2,3
3,
4,5
5,
6,30
7,5
8,5
9,31
10,31
11,
12,5
13,5

我想填充值5,但前提是前一个邻居属于特定的值列表,例如[30,31,32]。上例的输出应为:

0,
1,5
2,3
3,
4,5
5,
6,30
7,30
8,30
9,31
10,31
11,
12,5
13,5

我怎样才能做到这一点?
这是一个我正在努力完成的数据清理任务。目标是纠正受先前事件影响的天气状况的错误编码。

xt0899hw

xt0899hw1#

使用带有偏移的遮罩可以实现此功能

# Mask for the value of 5 and if the previous neighbor falls within a specific list of values
mask = (s == 5) & (s.shift().isin([30, 31, 32]))

# Replace the values with whatever you like
s = s.where(~mask, 0)
jrcvhitl

jrcvhitl2#

如果我没理解错的话,这是可行的:

import numpy as np
import pandas as pd

# Create the original Series with missing values represented as None
s = pd.Series([None, 5, 3, None, 5, None, 30, 5, 5, 31, 31, None, 5, 5],
            index=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13])

# Define the list of allowed preceding values
allowed_values = [30, 31, 32]

# Create an array of zeros with the same shape as s.values
modified_series = np.zeros_like(s.values)

# Replace all 5s in modified_series with np.nan
modified_series = np.where(s.values == 5, np.nan, modified_series)

# Replace all values in modified_series that are in allowed_values with the corresponding value in s
modified_series = np.where(s.isin(allowed_values), s, modified_series)

# Convert modified_series to a Pandas Series, preserve the original index, and forward fill np.nan values
modified_series = pd.Series(modified_series, index=s.index).ffill()

# Replace all 5s in s with the corresponding value in modified_series
modified_series = np.where(s == 5, modified_series.values, s)

# Convert modified_series to a Pandas Series and preserve the original index
modified_series = pd.Series(modified_series, index=s.index)

# Print the modified Series
print(modified_series)

这应返回:

0      NaN
1      5.0
2      3.0
3      NaN
4      5.0
5      NaN
6     30.0
7     30.0
8     30.0
9     31.0
10    31.0
11     NaN
12     5.0
13     5.0

编辑:已更改以删除for循环并使用矢量化。

wwtsj6pe

wwtsj6pe3#

下面是一个不需要循环(但会生成几个中间列)的繁琐解决方案

import pandas as pd 
df = pd.DataFrame([
    None, 5, 3, None, 5, None, 30, 5, 5, 31, 
    31, None, 5, 5], columns=['val'])

target_values = [30, 31, 32]
df['target'] = df.val.isin(target_values) # create bool mask

# index each number w/o considering 5
df['seq_idx'] = (df.val!=5).cumsum() 

# tag indexes that contain values that will be replaced
df['to_replace'] = df.groupby(
    'seq_idx')['target'].transform('first')

# get replacement values (first of each 'sequence')
df['replace_val'] = df.groupby(
    'seq_idx')['val'].transform('first') 

# actually replace them
df.loc[df.to_replace,'val']  = df.loc[df.to_replace, 'replace_val']

而且只需要删除一些列:)

nnt7mjpx

nnt7mjpx4#

这是一个棘手的问题。
假设您的示例系列可以构建为:

s = pd.Series(['', 5, 3, '', 5, '', 30, 5, 5, 31, 31, '', 5, 5])

您可以:

lst = [30, 31, 32]
#First you could get groups of continuous values like 5,5 at indexes 7,8.
g = (s.ne(s.shift())).cumsum()

#Then replace the values of those 5s whose previous value falls in the desired list
s.loc[(s.eq(5)) & (s.shift().isin(lst))] = s.shift()

#Then for each group do a transform with 'first' value so 
#each 5 next to each is replaced with first value which was set 
#in the second step.
s = s.groupby(g).transform('first')

print(s)
0       
1      5
2      3
3       
4      5
5       
6     30
7     30
8     30
9     31
10    31
11      
12     5
13     5
dtype: object
p4tfgftt

p4tfgftt5#

另一种可能的解决方案:

s = pd.Series([np.nan, 5, 3, np.nan, 5, np.nan, 30, 5, 5, 31, 31,np.nan , 5, 5])

mask = s.eq(5) & (s.shift().isin([5, 30, 31, 32]))
s1 = s.mask(mask, s.fillna('a').mask(s.eq(5)).ffill())
s = s1.mask(s1.eq('a'), s)

输出:

0      NaN
1      5.0
2      3.0
3      NaN
4      5.0
5      NaN
6     30.0
7     30.0
8     30.0
9     31.0
10    31.0
11     NaN
12     5.0
13     5.0
7rfyedvj

7rfyedvj6#

我想出了一个解决方案,很高兴听到你的想法。我还没有尝试过其他解决方案张贴在这里。

# pick out the rows of interest
x = df.loc[df[COL].isin([5,30,31,32]),COL]

# reindex to the original df before forward fill
x = x.reindex(df.index)

# after reindexing, there will be lots of missing rows
# these rows will be put back into the data frame
# fill those with an arbitrary number
y = x.fillna(999)

# turn all 5 into missing rows
y[y==5] = np.nan

# now we can forward fill those missing rows
y.fillna(method='ffill', inplace=True)

# reverse the step above
y[y==999] = np.nan

# put those rows of interest back to the data frame
df[COL].where(~df[COL].isin([5,30,31,32]),y,inplace=True)

相关问题