pandas 如何对 Dataframe 中的重复列求和,并在至少一个值为nan时返回nan

s8vozzvw  于 2022-12-02  发布在  其他
关注(0)|答案(2)|浏览(163)

I have a dataframe with duplicate columns (number not known a priori) like this example:
| | a | a | a | b | b |
| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
| 0 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | nan | 1 | 1 | 1 |
I need to be able to aggregate the columns by summing their values (by rows) and returning NaN if at least one value, in one of the columns among the duplicates, is NaN.
I have tried this code:

import numpy as np
import pandas as pd

df = pd.DataFrame([[1,1,1,1,1], [1,np.nan,1,1,1]], columns=['a','a','a','b','b'])
df = df.groupby(axis=1, level=0).sum()

The result i get is as follows, but it does not return NaN in the second row of column 'a'.
| | a | b |
| ------------ | ------------ | ------------ |
| 0 | 3 | 2 |
| 1 | 2 | 2 |
In the documentation of pandas.DataFrame.sum , there is the skipna parameter which might suit my case. But I am using the function pandas.core.groupby.GroupBy.sum which does not have this parameter, but the min_count which does what i want but the number is not known in advance and would be different for each duplicate column.

  • For example, a min_count=3 solves the problem for column 'a', but obviously returns NaN on the whole of column 'b'.*

The result I want to achieve is:
| | a | b |
| ------------ | ------------ | ------------ |
| 0 | 3 | 2 |
| 1 | nan | 2 |

juud5qan

juud5qan1#

一种解决方法是使用apply获取DataFrame.sum

df.groupby(level=0, axis=1).apply(lambda x: x.sum(axis=1, skipna=False))

输出量:

a    b
0  3.0  2.0
1  NaN  2.0
h79rfbju

h79rfbju2#

另一种可能的解决方案:

cols, ldf = df.columns.unique(), len(df)

pd.DataFrame(
    np.reshape([sum(df.loc[i, x]) for i in range(ldf) for x in cols],
               (len(cols), ldf)), 
    columns=cols)

输出量:

a    b
0  3.0  2.0
1  NaN  2.0

相关问题