pandas 如果一行包含至少两个not NaN值,则将该行拆分为两个单独的行

ttygqcqt  于 2023-10-14  发布在  其他
关注(0)|答案(3)|浏览(109)

我正试图转换为所需的输出格式与下面提到的要求.
提供的要求:

  • 每行只能保留一个非Nan值(* Trh 1 * 和 * Trh 2 * 除外)
  • 出于性能考虑,我希望避免使用遍历每一行的方法。
  • 我只包括了四个列,例如,在真实的场景中,有更多的列可以共享
    示例
  • 输入 *:

| 指数|模式|柱|TRH1| TRH2| Trh 3| Trh 4|
| --|--|--|--|--|--|--|
| 0 |方案_1| col_1|楠|0.01|楠|楠|
| 1 |图式_2| col_2|零点零二|0.03|楠|楠|
| 2 |图式_3| col_3| 0.03|零点零四分|0.05|楠|
| 3 |schema_4| col_4|楠|楠|零点零六|零点零七|

  • 预期输出 *:

| 指数|模式|柱|TRH1| TRH2| Trh 3| Trh 4|
| --|--|--|--|--|--|--|
| 0 |方案_1| col_1|楠|0.01|楠|楠|
| 1 |图式_2| col_2|零点零二|0.03|楠|楠|
| 2 |图式_3| col_3| 0.03|零点零四分|楠|楠|
| 3 |图式_3| col_3|楠|楠|0.05|楠|
| 4 |schema_4| col_4|楠|楠|零点零六|楠|
| 5 |schema_4| col_4|楠|楠|楠|零点零七|
我探索了以下方法:Split row into 2 based on condition pandas。然而,如果两列中没有Nan值,则该方法仅适用于拆分行。

gev0vcfq

gev0vcfq1#

处理跳转

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']
special = ['Trh1', 'Trh2']
others = list(df.columns.difference(cols))

out = (df
   .assign(init=lambda d: d[others].isna().all(axis=1))
   [cols+['init']+others]
   .set_index(cols).stack().to_frame()
   .assign(n=lambda d: d.groupby(level=range(df.index.ndim)).cumcount())
   .set_index('n', append=True)[0]
   .unstack(-2)
   .reset_index()
)

out.loc[out['init'].isna(), special] = np.nan

out = out.drop(columns=['n', 'init'])

out = out.dropna(subset=special+others, how='all')

输出量:

Index    Schema Column  Trh1  Trh2  Trh3  Trh4
0      0  schema_1  col_1   NaN  0.01   NaN   NaN
1      1  schema_2  col_2  0.02  0.03   NaN   NaN
2      2  schema_3  col_3  0.03  0.04   NaN   NaN
3      2  schema_3  col_3   NaN   NaN  0.05   NaN
5      3  schema_4  col_4   NaN   NaN  0.06   NaN
6      3  schema_4  col_4   NaN   NaN   NaN  0.07

原始答案

您可以通过stack/unstack将整形与重复数据消除结合使用:

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']

out = (df
   # stack and remove NaNs
   .set_index(cols).stack().to_frame()
   # deduplicate
   .assign(n=lambda d: d.groupby(level=range(df.index.ndim)).cumcount())
   # reshape to original shape
   .set_index('n', append=True)[0]
   .unstack(-2)
   # cleanup
   .reset_index()
   .drop(columns='n')
)

# add rows that were dropped because having no value
out = pd.concat([df[df[df.columns.difference(cols)].isna().all(axis=1)], out],
                ignore_index=True).sort_values(by='Index') # optional
  • 注:这要求在初始cols中没有重复。*

或者使用melt,这可能会占用更多内存,但如果您有副本,则会更健壮:

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']

out = (df.melt(cols)
       # drop NAs, except first row per group
       .loc[lambda d: d['value'].notna() | ~d[cols].duplicated()]
       # de-duplicate
       .assign(n=lambda d: d.groupby(cols, dropna=False).cumcount())
       # reshape
       .pivot(index=cols+['n'], columns='variable', values='value')
       # cleanup
       .reset_index().rename_axis(index=None, columns=None)
      )

输出量:

Index    Schema Column  Trh1  Trh2  Trh3  Trh4
0      0  schema_1  col_1   NaN  0.01   NaN   NaN
1      1  schema_2  col_2  0.02  0.03   NaN   NaN
2      2  schema_3  col_3  0.03  0.04  0.05   NaN
3      3  schema_4  col_4   NaN   NaN  0.06   NaN
4      3  schema_4  col_4   NaN   NaN   NaN  0.07
20jt8wwn

20jt8wwn2#

这个应该也可以。

c = ['Trh1','Trh2']

(df.set_index(['Schema','Column'],append=True)
.stack()
.to_frame()
.assign(cc = lambda x: x.groupby(level=[0]).cumcount().where(~x.index.isin(c,level=-1)))
.set_index('cc',append=True)[0]
.unstack(level=-2)
.droplevel([0,-1])
.reset_index())

输出量:

Schema Column  Trh1  Trh2  Trh3  Trh4
0  schema_1  col_1   NaN  0.01   NaN   NaN
1  schema_2  col_2  0.02  0.03   NaN   NaN
2  schema_3  col_3  0.03  0.04   NaN   NaN
3  schema_3  col_3   NaN   NaN  0.05   NaN
4  schema_4  col_4   NaN   NaN  0.06   NaN
5  schema_4  col_4   NaN   NaN   NaN  0.07
vh0rcniy

vh0rcniy3#

import pandas as pd
d1 = pd.DataFrame({'schema':['schema_1','schema_2','schema_3','schema_4'],
                   'Column':['col_1','col_2','col_3','col_4'],
                    'trh1':[None,2,3,None],
                    'trh2':[1,3,4,None],
                    'trh3':[None,None,5,6],
                    'trh4':[None,None,None,7]})

d2 = pd.concat([d1.drop(['trh3','trh4'],axis=1),
                d1[['Column','schema','trh3']],
                d1[['Column','schema','trh4']],])

d2.dropna(subset=d2.columns.difference(['schema','Column']), how='all', inplace=True)

d2.reset_index(drop=True, inplace=True)

唯一地分离所有列TRH 3、TRH 4(组合TRH 1和TRH 2)并将它们连接。删除NaN值并重置索引。

相关问题