I have a dataframe with duplicate columns (number not known a priori) like this example:
| | a | a | a | b | b |
| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
| 0 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | nan | 1 | 1 | 1 |
I need to be able to aggregate the columns by summing their values (by rows) and returning NaN if at least one value, in one of the columns among the duplicates, is NaN.
I have tried this code:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,1,1,1,1], [1,np.nan,1,1,1]], columns=['a','a','a','b','b'])
df = df.groupby(axis=1, level=0).sum()
The result i get is as follows, but it does not return NaN in the second row of column 'a'.
| | a | b |
| ------------ | ------------ | ------------ |
| 0 | 3 | 2 |
| 1 | 2 | 2 |
In the documentation of pandas.DataFrame.sum , there is the skipna
parameter which might suit my case. But I am using the function pandas.core.groupby.GroupBy.sum which does not have this parameter, but the min_count
which does what i want but the number is not known in advance and would be different for each duplicate column.
- For example, a min_count=3 solves the problem for column 'a', but obviously returns NaN on the whole of column 'b'.*
The result I want to achieve is:
| | a | b |
| ------------ | ------------ | ------------ |
| 0 | 3 | 2 |
| 1 | nan | 2 |
2条答案
按热度按时间juud5qan1#
一种解决方法是使用
apply
获取DataFrame.sum
:输出量:
h79rfbju2#
另一种可能的解决方案:
输出量: