pandas 基于结束字符串条件的自增标志公式化

ilmyapht  于 2023-01-11  发布在  其他
关注(0)|答案(3)|浏览(111)

我有以下 Dataframe

df = pd.DataFrame({'Category': {0: 'onboarding segment-confirmation-unexpected-input origin',
  1: 'onboarding segment-confirmation-unexpected-input view',
  2: 'product-availability cpf-request-unexpected-input origin',
  3: 'product-availability postalcode-validation-true-unexpected-input origin',
  4: 'product-availability postalcode-validation-true-unexpected-input view'},
 'UserId': {0: 9090, 1: 4545, 2: 3266, 3: 2894, 4: 2772}})

我想做的是制定一个标志,检查不同于单词“view”或“origin”的字符串部分是否等于先前的值,如果是,则保持该标志,如果不增加标志值.
预期结果

df = pd.DataFrame({'Category': {0: 'onboarding segment-confirmation-unexpected-input origin',
      1: 'onboarding segment-confirmation-unexpected-input view',
      2: 'product-availability cpf-request-unexpected-input origin',
      3: 'product-availability postalcode-validation-true-unexpected-input origin',
      4: 'product-availability postalcode-validation-true-unexpected-input view'},
     'UserId': {0: 9090, 1: 4545, 2: 3266, 3: 2894, 4: 2772},
'Flag':{0:'Flag_1',1:'Flag_1',2:'Flag_2',3:'Flag_3',4:'Flag_3'}})

怎么做呢?我试着把它切片,并制定一个groupby,但我在增加的部分有一点困难。

kulphzqa

kulphzqa1#

假设你想考虑前两个块或字符串(块之间用空格分隔):

# get substrings, keep first 2 (can be changed)
df2 = df['Category'].str.split(expand=True).iloc[:, :2]

# start new group if any value is different from the previous row
group = df2.ne(df2.shift()).any(axis=1).cumsum()

# add flag
df['Flag'] = 'Flag_'+group.astype(str)

输出:

Category  UserId    Flag
0  onboarding segment-confirmation-unexpected-inp...    9090  Flag_1
1  onboarding segment-confirmation-unexpected-inp...    4545  Flag_1
2  product-availability cpf-request-unexpected-in...    3266  Flag_2
3  product-availability postalcode-validation-tru...    2894  Flag_3
4  product-availability postalcode-validation-tru...    2772  Flag_3
0x6upsns

0x6upsns2#

这对我很有效:

df = pd.DataFrame({'Category': {0: 'onboarding segment-confirmation-unexpected-input origin',
  1: 'onboarding segment-confirmation-unexpected-input view',
  2: 'product-availability cpf-request-unexpected-input origin',
  3: 'product-availability postalcode-validation-true-unexpected-input origin',
  4: 'product-availability postalcode-validation-true-unexpected-input view'},
 'UserId': {0: 9090, 1: 4545, 2: 3266, 3: 2894, 4: 2772}})

#I chose 40 but you can change it to fit your needs depending on the data
df['temp']=df['Category'].str[:40]

df['Flag'] = df.groupby(['temp'], sort=False).ngroup() + 1
df['Flag'] ='Flag_' + df['Flag'].astype(str)
iqxoj9l9

iqxoj9l93#

df1=df.Category.str.split(' ',expand=True).iloc[:,:-1]
df.assign(flag=df1.ne(df1.shift()).any(axis=1).cumsum().map('Flag_{}'.format))

out

                                          Category  UserId    flag
0  onboarding segment-confirmation-unexpected-inp...    9090  Flag_1
1  onboarding segment-confirmation-unexpected-inp...    4545  Flag_1
2  product-availability cpf-request-unexpected-in...    3266  Flag_2
3  product-availability postalcode-validation-tru...    2894  Flag_3
4  product-availability postalcode-validation-tru...    2772  Flag_3

相关问题