numpy 基于外部条件添加多个框行

a0zr77ik  于 2023-10-19  发布在  其他
关注(0)|答案(6)|浏览(154)

我有这个dataframe:

Env location lob      grid row server        model        make          slot
Prod USA     Market   AB3 bc2  Server123     Hitachi        dcs           1
Prod USA     Market   AB3 bc2  Server123     Hitachi        dcs           2
Prod USA     Market   AB3 bc2  Server123     Hitachi        dcs           3
Prod USA     Market   AB3 bc2  Server123     Hitachi       dcs           4
Dev  EMEA    Ins     AB6 bc4  Serverabc     IBM            abc           3
Dev  EMEA    Ins     AB6 bc4  Serverabc     IBM            abc          3
Dev  EMEA    Ins     AB6 bc4  Serverabc     IBM            abc           3
Dev  EMEA    Ins     AB6 bc4  Serverabc     IBM            abc           4
Dev  EMEA    Ins     AB6 bc4  Serverabc     IBM            abc           5
Dev  EMEA    Ins     AB6 bc4  Serverabc     IBM            abc           5
Dev  EMEA    Ins     AB6 bc4  Serverabc     IBM            abc           6
UAT  PAC     Retail   AB6 bc4  Serverzzz     Cisco          ust           3
UAT  PAC     Retail   BB6 bc4  Serverzzz     Cisco          ust           4
UAT  PAC     Retail   BB6 bc4  Serverzzz     Cisco          ust           5
UAT  PAC     Retail   BB6 bc4  Serverzzz     Cisco          ust           6

在本示例中:

  • 如果型号为IBM,则必须有8个插槽;因为slot从slot=3开始,所以它必须从3到10。在这种情况下,仅存在插槽3至6。
  • 因此,我需要再添加4行(插槽7、8、9、10)。
  • 如果型号为Cisco,则Cisco的行数需要为6。仅插槽3至6存在。
  • 因此,我需要再添加2行

新行:

  • 必须重复模型的最后一行,同时递增插槽编号
  • 他们的“网格”单元格必须显示“可用”。

这需要通过编程来完成,在给定模型的情况下,我需要知道插槽的总数,如果插槽的数量很短,我需要创建新的行。
最后一个数组应该是这样的:

Env location lob    grid      row    server        model       make        slot
Prod USA     Market AB3       bc2  Server123     Hitachi        dcs           1
Prod USA     Market AB3       bc2  Server123     Hitachi        dcs           2
Prod USA     Market AB3       bc2  Server123     Hitachi        dcs           3
Prod USA     Market AB3       bc2  Server123     Hitachi.       dcs           4
Dev  EMEA    Ins.   AB6       bc4  Serverabc     IBM            abc           3
Dev  EMEA    Ins.   AB6       bc4  Serverabc     IBM            abc           4
Dev  EMEA    Ins.   AB6       bc4  Serverabc     IBM            abc           5
Dev  EMEA    Ins.   AB6       bc4  Serverabc     IBM            abc           6
Dev  EMEA    Ins.   available bc4  Serverabc     IBM            abc           7
Dev  EMEA    Ins.   available bc4  Serverabc     IBM            abc           8
Dev  EMEA    Ins.   available bc4  Serverabc     IBM            abc           9
Dev  EMEA    Ins.   available bc4  Serverabc     IBM            abc           10
UAT  PAC     Retail   AB6     bc4  Serverzzz     Cisco          ust           3
UAT  PAC     Retail   BB6     bc4  Serverzzz     Cisco          ust           4
UAT  PAC     Retail   BB6     bc4  Serverzzz     Cisco          ust           5
UAT  PAC     Retail   BB6     bc4  Serverzzz     Cisco          ust           6
UAT  PAC     Retail  available bc4  Serverzzz     Cisco          ust          7
UAT  PAC     Retail  available bc4  Serverzzz     Cisco          ust          8

我试过这样的东西:

def slots(row):
   if 'IBM' in row['model']:
      number_row=8
   if 'Cisco' in row['model']:
      number_row=6

我该怎么做?

ljo96ir5

ljo96ir51#

我将使用groupby.apply向每个 model 添加新行。一般工作流程如下。
1.从每个模型中删除重复的插槽。
1.按'model'列对该框架进行分组。
1.对于每个模型,只有在以下情况下才能做任何事情
a.它是IBM或Cisco(通过它是否是N_slots字典中的键来识别)
B.槽数未达到要求的槽数(N_slots字典中的值)
1.如果满足(3),则使用reindex()添加新的空行。
1.为'slots'列分配槽值(例如,对于IBM,则为3-10)
1.在grid列中用“available”填充新创建的空行。
1.用上一行的值(ffill())填充所有其他新创建的行。
1.重置索引以删除重复索引。

def add_slots(s):
    # get model name
    model = s['model'].iat[0]
    # get how many slots there should be for this model
    slots = N_slots.get(model, 0)
    # where to start reindexing
    start = s.index[0]
    low = s['slot'].dropna().astype(int).min()
    # for pandas>=1.1, remove the previous line and uncomment the next line
    # low = s['slot'].min()

    if len(s) < slots:
        # add new indices
        s = s.reindex(range(start, start + slots))
        # assign slots
        s['slot'] = range(low, low + slots)
        # assign grids at newly created slots to be 'available'
        s['grid'] = s['grid'].fillna('available')
        
    return s


N_slots = {'IBM': 8, 'Cisco': 6}

new_df = (
    df.drop_duplicates(['server', 'model', 'slot'], ignore_index=True)          # remove duplicate slots
    .groupby('model', sort=False, group_keys=False).apply(add_slots)  # add new slots
    .ffill()                                                          # fill rest of the columns
    .sort_values(by=['server', 'slot'])
    .reset_index(drop=True)                                           # reset index
)
Env location     lob       grid  row     server    model make  slot
0   Prod      USA  Market        AB3  bc2  Server123  Hitachi  dcs     1
1   Prod      USA  Market        AB3  bc2  Server123  Hitachi  dcs     2
2   Prod      USA  Market        AB3  bc2  Server123  Hitachi  dcs     3
3   Prod      USA  Market        AB3  bc2  Server123  Hitachi  dcs     4
4    Dev     EMEA     Ins        AB6  bc4  Serverabc      IBM  abc     3
5    Dev     EMEA     Ins        AB6  bc4  Serverabc      IBM  abc     4
6    Dev     EMEA     Ins        AB6  bc4  Serverabc      IBM  abc     5
7    Dev     EMEA     Ins        AB6  bc4  Serverabc      IBM  abc     6
8    Dev     EMEA     Ins  available  bc4  Serverabc      IBM  abc     7
9    Dev     EMEA     Ins  available  bc4  Serverabc      IBM  abc     8
10   Dev     EMEA     Ins  available  bc4  Serverabc      IBM  abc     9
11   Dev     EMEA     Ins  available  bc4  Serverabc      IBM  abc    10
12   UAT      PAC  Retail        AB6  bc4  Serverzzz    Cisco  ust     3
13   UAT      PAC  Retail        BB6  bc4  Serverzzz    Cisco  ust     4
14   UAT      PAC  Retail        BB6  bc4  Serverzzz    Cisco  ust     5
15   UAT      PAC  Retail        BB6  bc4  Serverzzz    Cisco  ust     6
16   UAT      PAC  Retail  available  bc4  Serverzzz    Cisco  ust     7
17   UAT      PAC  Retail  available  bc4  Serverzzz    Cisco  ust     8
zengzsys

zengzsys2#

所提出的函数arrange_model_slots为所需模型的组提供处理,应用注入错过的插槽号和标记底层的逻辑,可选地附加行:

def arrange_model_slots(g, m_slotsize_map):
    if g.name not in m_slotsize_map:  # skip other models
        return g

    sl_valid_range = m_slotsize_map[g.name]  # total valid number of slots
    slots = g['slot'].unique()
    
    if slots.size < sl_valid_range:  # is there missed slots?
        nslots_to_append = sl_valid_range - slots.size
        nrows_to_append = sl_valid_range - g.index.size
        if nrows_to_append:  # replicate the last row if needed
            g = g.reindex(range(g.index[0], g.index[0] + sl_valid_range),
                          axis=0, method='ffill')

        # reorder slot numbers
        g['slot'] = range(slots.min(), slots.max() + nslots_to_append + 1)
        # mark the rows with injected slots as 'available' grid
        g.loc[g.index[-nslots_to_append:], 'grid'] = 'available'
    return g

model_slotsize_map = {'IBM': 8, 'Cisco': 6}
df = (df.groupby('model').apply(arrange_model_slots, model_slotsize_map)
      .reset_index(drop=True))

print(df)
Env location     lob       grid  row     server    model make  slot
0    UAT      PAC  Retail        AB6  bc4  Serverzzz    Cisco  ust     3
1    UAT      PAC  Retail        BB6  bc4  Serverzzz    Cisco  ust     4
2    UAT      PAC  Retail        BB6  bc4  Serverzzz    Cisco  ust     5
3    UAT      PAC  Retail        BB6  bc4  Serverzzz    Cisco  ust     6
4    UAT      PAC  Retail  available  bc4  Serverzzz    Cisco  ust     7
5    UAT      PAC  Retail  available  bc4  Serverzzz    Cisco  ust     8
6   Prod      USA  Market        AB3  bc2  Server123  Hitachi  dcs     1
7   Prod      USA  Market        AB3  bc2  Server123  Hitachi  dcs     2
8   Prod      USA  Market        AB3  bc2  Server123  Hitachi  dcs     3
9   Prod      USA  Market        AB3  bc2  Server123  Hitachi  dcs     4
10   Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM  abc     3
11   Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM  abc     4
12   Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM  abc     5
13   Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM  abc     6
14   Dev     EMEA    Ins.  available  bc4  Serverabc      IBM  abc     7
15   Dev     EMEA    Ins.  available  bc4  Serverabc      IBM  abc     8
16   Dev     EMEA    Ins.  available  bc4  Serverabc      IBM  abc     9
17   Dev     EMEA    Ins.  available  bc4  Serverabc      IBM  abc    10
rsaldnfx

rsaldnfx3#

需要填充的记录数可以在字典的第一行输入:

#dictionary to enter values into
d = {'IBM':8,'Cisco':6}

#series is created populating models that werent in the above dictionary
s = (pd.Series(d).reindex(df['model'].unique())
     .fillna(df.groupby('model')['slot'].count()))

#convert slot to numeric
df['slot'] = pd.to_numeric(df['slot'],errors = 'coerce')

#remove duplicated values
cols = ['server','model','slot']
df2 = df.drop_duplicates(cols)

#create index with model names and slot numbers
idx = (pd.MultiIndex.from_frame(
    df2.groupby(['server','model'],sort=False)
    .apply(lambda x: np.arange(s[x.name[-1]]) + x['slot'].min())
    .explode()
    .reset_index(),names = cols))

#reindex using the index created
df2 = (df2.set_index(cols)
       .reindex(idx)
       .fillna({'grid':'available'})
       .ffill()
       .reset_index()
       .reindex(df.columns,axis=1)
       .astype({'slot':'int'}))

输出量:

Env location     lob       grid  row     server    model make  slot
0   Prod      USA  Market        AB3  bc2  Server123  Hitachi  dcs     1
1   Prod      USA  Market        AB3  bc2  Server123  Hitachi  dcs     2
2   Prod      USA  Market        AB3  bc2  Server123  Hitachi  dcs     3
3   Prod      USA  Market        AB3  bc2  Server123  Hitachi  dcs     4
4    Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM  abc     3
5    Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM  abc     4
6    Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM  abc     5
7    Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM  abc     6
8    Dev     EMEA    Ins.  available  bc4  Serverabc      IBM  abc     7
9    Dev     EMEA    Ins.  available  bc4  Serverabc      IBM  abc     8
10   Dev     EMEA    Ins.  available  bc4  Serverabc      IBM  abc     9
11   Dev     EMEA    Ins.  available  bc4  Serverabc      IBM  abc    10
12   UAT      PAC  Retail        AB6  bc4  Serverzzz    Cisco  ust     3
13   UAT      PAC  Retail        BB6  bc4  Serverzzz    Cisco  ust     4
14   UAT      PAC  Retail        BB6  bc4  Serverzzz    Cisco  ust     5
15   UAT      PAC  Retail        BB6  bc4  Serverzzz    Cisco  ust     6
16   UAT      PAC  Retail  available  bc4  Serverzzz    Cisco  ust     7
17   UAT      PAC  Retail  available  bc4  Serverzzz    Cisco  ust     8
3hvapo4f

3hvapo4f4#

必填模块

import pandas as pd
import numpy as np

已用输入

在您的注解之后,所有列都以“object”类型开始。该解决方案对于包含字符串和整数混合的'slot'列是健壮的,如您的相关问题how do you convert data frame column values to integer中所述。

df = pd.DataFrame(columns = ['Env','location','lob','grid','row','server','model','make','slot'],
                  data = [['Prod','USA','Market','AB3','bc2','Server123','Hitachi','dcs','1'],
                          ['Prod','USA','Market','AB3','bc2','Server123','Hitachi','dcs','2'],
                          ['Prod','USA','Market','AB3','bc2','Server123','Hitachi','dcs','3'],
                          ['Prod','USA','Market','AB3','bc2','Server123','Hitachi','dcs','4'],
                          ['Dev','EMEA','Ins.','AB6','bc4','Serverabc','IBM','abc','3'],
                          ['Dev','EMEA','Ins.','AB6','bc4','Serverabc','IBM','abc.','3'],
                          ['Dev','EMEA','Ins.','AB6','bc4','Serverabc','IBM','abc','3'],
                          ['Dev','EMEA','Ins.','AB6','bc4','Serverabc','IBM','abc','4'],
                          ['Dev','EMEA','Ins.','AB6','bc4','Serverabc','IBM','abc','5'],
                          ['Dev','EMEA','Ins.','AB6','bc4','Serverabc','IBM','abc','5'],
                          ['Dev','EMEA','Ins.','AB6','bc4','Serverabc','IBM','abc','6'],
                          ['UAT','PAC','Retail','AB6','bc4','Serverzzz','Cisco','ust',3],
                          ['UAT','PAC','Retail','BB6','bc4','Serverzzz','Cisco','ust',4],
                          ['UAT','PAC','Retail','BB6','bc4','Serverzzz','Cisco','ust',5],
                          ['UAT','PAC','Retail','BB6','bc4','Serverzzz','Cisco','ust',6]])

df
     Env location     lob grid  row     server    model  make slot
0   Prod      USA  Market  AB3  bc2  Server123  Hitachi   dcs    1
1   Prod      USA  Market  AB3  bc2  Server123  Hitachi   dcs    2
2   Prod      USA  Market  AB3  bc2  Server123  Hitachi   dcs    3
3   Prod      USA  Market  AB3  bc2  Server123  Hitachi   dcs    4
4    Dev     EMEA    Ins.  AB6  bc4  Serverabc      IBM   abc    3
5    Dev     EMEA    Ins.  AB6  bc4  Serverabc      IBM  abc.    3
6    Dev     EMEA    Ins.  AB6  bc4  Serverabc      IBM   abc    3
7    Dev     EMEA    Ins.  AB6  bc4  Serverabc      IBM   abc    4
8    Dev     EMEA    Ins.  AB6  bc4  Serverabc      IBM   abc    5
9    Dev     EMEA    Ins.  AB6  bc4  Serverabc      IBM   abc    5
10   Dev     EMEA    Ins.  AB6  bc4  Serverabc      IBM   abc    6
11   UAT      PAC  Retail  AB6  bc4  Serverzzz    Cisco   ust    3
12   UAT      PAC  Retail  BB6  bc4  Serverzzz    Cisco   ust    4
13   UAT      PAC  Retail  BB6  bc4  Serverzzz    Cisco   ust    5
14   UAT      PAC  Retail  BB6  bc4  Serverzzz    Cisco   ust    6

df.dtypes
Env         object
location    object
lob         object
grid        object
row         object
server      object
model       object
make        object
slot        object
dtype: object

设置外部条件

开始,一个字典看起来适合预先设置目标,即。每个型号的任意最大插槽数。我们将在上面迭代:

# Define requirements
dict_N_slots = {'IBM'    : 8,
                'Cisco'  : 6,
                'Hitachi': 3,
                'Spam'   : 1} 
# Passing robustness test:
# "Hitachi": requesting less rows than already present
# "Spam"   : model that does not appear in the input data

方法:

1.使用从[]解析内部调用的def missing_rows()生成缺少的插槽
1.使用pd.concat将它们集成到嵌入式框架中
1.按升序对插槽重新编号
1.使用 sortfill 完成从初始行复制内容。

# 1. Create required number of rows
def missing_rows(model, target_n_slots, df):
    '''For a given model, create a minimal dataframe containing the calculated number of missing rows.'''
    return pd.DataFrame(index = np.arange(target_n_slots - len(df.loc[df['model']==model, 'slot'])), 
                        data  = {'model': model,
                                 'grid' : 'available'})

# 2. Integrate new rows to the original dataframe
def add_rows(df, dict_N_slots):
    '''Run the comprehension on missing rows'''
    return pd.concat([df, 
                      pd.concat([missing_rows(model          = model_i, 
                                              target_n_slots = n_slots_i, 
                                              df             = df) 
                                 for model_i, n_slots_i in dict_N_slots.items() 
                                 if  model_i            in df['model'].values])])

# 3. Reallocate slot numbers
def reindex_slots(df):
    '''Ensure slot integer dtype, then for each model, renumber slots in ascending sequence'''
    df['slot'] = pd.to_numeric(df['slot']).astype('Int64')
    for model in df['model'].unique():
        slots = df.loc[df['model']==model,'slot']
        df.loc[df['model']==model,'slot'] = np.arange(len(slots)) + slots.min()    
    return df

# 4. Finalize
def sort_fill(df):
    '''Ensure the relevant cells are duplicated into the new rows.'''
    return df.sort_values(by=['model', 'slot']).reset_index(drop=True).ffill()

df_final = sort_fill(reindex_slots(add_rows(df, dict_N_slots)))

**最终输出:**所有行已排序,内容为必填项。

Env location     lob       grid  row     server    model  make  slot
0    UAT      PAC  Retail        AB6  bc4  Serverzzz    Cisco   ust     3
1    UAT      PAC  Retail        BB6  bc4  Serverzzz    Cisco   ust     4
2    UAT      PAC  Retail        BB6  bc4  Serverzzz    Cisco   ust     5
3    UAT      PAC  Retail        BB6  bc4  Serverzzz    Cisco   ust     6
4    UAT      PAC  Retail  available  bc4  Serverzzz    Cisco   ust     7
5    UAT      PAC  Retail  available  bc4  Serverzzz    Cisco   ust     8
6   Prod      USA  Market        AB3  bc2  Server123  Hitachi   dcs     1
7   Prod      USA  Market        AB3  bc2  Server123  Hitachi   dcs     2
8   Prod      USA  Market        AB3  bc2  Server123  Hitachi   dcs     3
9   Prod      USA  Market        AB3  bc2  Server123  Hitachi   dcs     4
10   Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM   abc     3
11   Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM  abc.     4
12   Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM   abc     5
13   Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM   abc     6
14   Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM   abc     7
15   Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM   abc     8
16   Dev     EMEA    Ins.        AB6  bc4  Serverabc      IBM   abc     9
17   Dev     EMEA    Ins.  available  bc4  Serverabc      IBM   abc    10

灵感来源:

yc0p9oo0

yc0p9oo05#

如果我对你的问题理解正确的话,你可以为IBM和Cisco添加新的行,如下所示:

# this is just for IBM similar approach for Cisco
ibm_slots = df[df["model"] == "IBM"]['slot'].to_numpy()
for i in range(max(ibm_slots)+1,11): #hardcoded to 11
    new_row = {'Env': ['Dev'],
    'location': ['EMEA'],
    'lob': ['Ins.'],
    'grid': ['available'],
    'row': ['bc4'],
    'server': ['Serverabc'],
    'model': ['IBM'],
    'make': ['abc'],
    'slot': [i]}
    # df = df.append(new_row, ignore_index=True) # gives a FutureWarning 
    df = pd.concat([df, pd.DataFrame(new_row)])
df = df.reset_index().drop(columns=['index'])

这将给给予一个输出(然后你可以根据你的需要排序):

Env location     lob       grid  row     server     model make  slot
0   Prod      USA  Market        AB3  bc2  Server123   Hitachi  dcs     1
1   Prod      USA  Market        AB3  bc2  Server123   Hitachi  dcs     2
2   Prod      USA  Market        AB3  bc2  Server123   Hitachi  dcs     3
3   Prod      USA  Market        AB3  bc2  Server123  Hitachi.  dcs     4
4    Dev     EMEA    Ins.        AB6  bc4  Serverabc       IBM  abc     3
5    Dev     EMEA    Ins.        AB6  bc4  Serverabc       IBM  abc     4
6    Dev     EMEA    Ins.        AB6  bc4  Serverabc       IBM  abc     5
7    Dev     EMEA    Ins.        AB6  bc4  Serverabc       IBM  abc     6
8    UAT      PAC  Retail        AB6  bc4  Serverzzz     Cisco  ust     3
9    UAT      PAC  Retail        BB6  bc4  Serverzzz     Cisco  ust     4
10   UAT      PAC  Retail        BB6  bc4  Serverzzz     Cisco  ust     5
11   UAT      PAC  Retail        BB6  bc4  Serverzzz     Cisco  ust     6
12   Dev     EMEA    Ins.  available  bc4  Serverabc       IBM  abc     7
13   Dev     EMEA    Ins.  available  bc4  Serverabc       IBM  abc     8
14   Dev     EMEA    Ins.  available  bc4  Serverabc       IBM  abc     9
15   Dev     EMEA    Ins.  available  bc4  Serverabc       IBM  abc    10
qvk1mo1f

qvk1mo1f6#

两个主要观点:
1.应用numpy.repeat复制记录
1.应用.idxmax()查找模型的最后一条记录(或.last()查找排序数据)

model_slots = pd.Series({
    'IBM': 8, 
    'Cisco': 6,
})

df = df.astype({'slot': int}).drop_duplicates(['server','model','slot'])
model_groups = df.groupby('model')
last_row = model_groups['slot'].idxmax()      # index of the record with max slot in the group
additional_slots = (model_slots - model_groups.size()).where(lambda x: x > 0).dropna()

def repeat(model):
    return (
        pd.DataFrame(
            np.repeat(df.loc[[last_row[model]]], additional_slots[model], axis=0),   # need double brackets [[...]] to preserve 2-dimensional structure
            columns = df.columns,
        ).assign(
            slot = lambda df: df.slot + df.index + 1,    # new index starts from 0
            grid = 'available',
        )
    )

answer = pd.concat(
    [df, *(repeat(model) for model in additional_slots.index)],
).sort_values(['model','slot'], ignore_index=True)

相关问题