如何从Pandas Dataframe 中只获取初始NaN值和前导非NaN值?

mnowg1ta  于 2022-12-09  发布在  其他
关注(0)|答案(2)|浏览(166)

I have a dataframe where the rows contain NaN values. The df contains original columns namely Heading 1 Heading 2 and Heading 3 and extra columns called Unnamed: 1 Unnamed: 2 and Unnamed: 3 as shown:
| Heading 1 | Heading 2 | Heading 3 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 |
| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
| NaN | 34 | 24 | 45 | NaN | NaN |
| NaN | NaN | 24 | 45 | 11 | NaN |
| NaN | NaN | NaN | 45 | 45 | 33 |
| 4 | NaN | 24 | NaN | NaN | NaN |
| NaN | NaN | 4 | NaN | NaN | NaN |
| NaN | 34 | 24 | NaN | NaN | NaN |
| 22 | 34 | 24 | NaN | NaN | NaN |
| NaN | 34 | NaN | 45 | NaN | NaN |
I want to iterate through each row and find out the amount of initial NaN values in original columns (Heading 1 Heading 2 and Heading 3) and the amount of non NaN values in the extra columns (Unnamed: 1 Unnamed: 2 and Unnamed: 3). For each and every row this should be calculated and returned in a dictionary where the key is the index of the row and the value for that key is a list containing the amount of initial NaN values in original columns (Heading 1 Heading 2 and Heading 3) and the second element of the list would the amount of non NaN values in the extra columns (Unnamed: 1 Unnamed: 2 and Unnamed: 3).
So the result for the above dataframe would be:

{0 : [1, 1], 
1 : [2, 2], 
2 : [3, 3], 
3 : [0, 0], 
4 : [2, 0], 
5 : [1, 0],
6 : [0, 0],
7 : [1, 1]}

Notice how in row 3 and row 7 the original columns contain 1 and 2 NaN respectively but only the initial NaN's are counted and not the in between ones!

UPDATE / RESULTS:

Both @mozaway and @Panda Kim gave the correct solution for the current dataframe but @mozway solution does not work at all for another test dataframe.
@Panda Kim gave 2 solutions but both the methods he gave (cumsum() and x.first_valid_index()) are giving slightly different results for the different dataframe.
| Heading 1 | Heading 2 | Heading 3 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 |
| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
| NaN | 34 | 24 | 45 | NaN | NaN | NaN |
| NaN | NaN | 24 | 45 | 11 | NaN | NaN |
| NaN | NaN | NaN | 45 | 45 | 33 | NaN |
| 4 | NaN | 24 | NaN | NaN | NaN | NaN |
| NaN | NaN | 4 | NaN | NaN | NaN | NaN |
| NaN | 34 | 24 | NaN | NaN | NaN | NaN |
| 22 | 34 | 24 | NaN | NaN | NaN | NaN |
| NaN | 34 | NaN | 45 | NaN | NaN | NaN |
| NaN | NaN | NaN | NaN | 12 | 22 | 45 |
| NaN | NaN | NaN | NaN | NaN | 11 | 69 |
| NaN | NaN | NaN | NaN | 12 | NaN | 45 |
| NaN | NaN | NaN | NaN | NaN | NaN | 45 |
| NaN | NaN | NaN | NaN | NaN | 44 | NaN |
For the above df here are the results:
@Panda KIM (first_valid_index())

{0: [1, 1], 1: [2, 2], 2: [3, 3], 3: [0, 0], 4: [2, 0], 5: [1, 0], 6: [0, 0], 7: [1, 1], 8: [3, 3], 9: [3, 2], 10: [3, 2], 11: [3, 1], 12: [3, 1]}

@Panda Kim (cumsum())

{0: [1, 1], 1: [2, 2], 2: [3, 3], 3: [0, 0], 4: [2, 0], 5: [1, 0], 6: [0, 0], 7: [1, 1], 8: [4, 3], 9: [5, 2], 10: [4, 2], 11: [6, 1], 12: [5, 1]}

@mozway solution

{0: [1, 1], 1: [2, 2], 2: [3, 3], 3: [0, 0], 4: [2, 0], 5: [1, 0], 6: [0, 0], 7: [1, 1], 8: [3, 0], 9: [3, 0], 10: [3, 0], 11: [3, 0], 12: [3, 0]}
ha5z0ras

ha5z0ras1#

您可以用途:

m = df.columns.str.startswith('Unnamed')

out = (df
   .groupby(m, axis=1)
   .apply(lambda g: (g.notna() if g.name else g.isna())
                     .cummin(axis=1).sum(axis=1)
          )
   .set_axis(['named', 'unnamed'], axis=1)
 )

输出量:

named  unnamed
0      1        1
1      2        2
2      3        3
3      0        0
4      2        0
5      1        0
6      0        0
7      1        1

作为字典

out.T.to_dict('list')

输出量:

{0: [1, 1],
 1: [2, 2],
 2: [3, 3],
 3: [0, 0],
 4: [2, 0],
 5: [1, 0],
 6: [0, 0],
 7: [1, 1]}
n6lpvg4x

n6lpvg4x2#

First

divide dataframe (iloc or filter or and so on)

df1 = df.iloc[:, :3]
df2 = df.iloc[:, 3:]

Second

count initial NaNs in df1 and count notnull in df2

s1 = df1.apply(lambda x: (x.notnull().cumsum() == 0).sum(), axis=1)
s2 = df2.notnull().sum(axis=1)

Last

concat and make dict

pd.concat([s1, s2], axis=1).T.to_dict('list')

result:

{0: [1, 1],
 1: [2, 2],
 2: [3, 3],
 3: [0, 0],
 4: [2, 0],
 5: [1, 0],
 6: [0, 0],
 7: [1, 1]}

Update

data = [[None, 34.0, 24.0, 45.0, None, None, None],
        [None, None, 24.0, 45.0, 11.0, None, None],
        [None, None, None, 45.0, 45.0, 33.0, None],
        [4.0, None, 24.0, None, None, None, None],
        [None, None, 4.0, None, None, None, None],
        [None, 34.0, 24.0, None, None, None, None],
        [22.0, 34.0, 24.0, None, None, None, None],
        [None, 34.0, None, 45.0, None, None, None],
        [None, None, None, None, 12.0, 22.0, 45.0],
        [None, None, None, None, None, 11.0, 69.0],
        [None, None, None, None, 12.0, None, 45.0],
        [None, None, None, None, None, None, 45.0],
        [None, None, None, None, None, 44.0, None]]
col = ['Heading 1', 'Heading 2', 'Heading 3', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']
df = pd.DataFrame(data, columns=col)

df1 = df.iloc[:, :3]
df2 = df.iloc[:, 3:]
s1 = df1.apply(lambda x: (x.notnull().cumsum() == 0).sum(), axis=1)
s2 = df2.notnull().sum(axis=1)
pd.concat([s1, s2], axis=1).T.to_dict('list')

result:

{0: [1, 1],
 1: [2, 2],
 2: [3, 3],
 3: [0, 0],
 4: [2, 0],
 5: [1, 0],
 6: [0, 0],
 7: [1, 1],
 8: [3, 3],
 9: [3, 2],
 10: [3, 2],
 11: [3, 1],
 12: [3, 1]}

Anyone can know that this is different from questioner's result (@Panda Kim (cumsum())) .
Of course, if function is not applied to df1 , the result is different.
Let's apply cumsum code to df instead of df1 for wrong result:

df2 = df.iloc[:, 3:]
s1 = df.apply(lambda x: (x.notnull().cumsum() == 0).sum(), axis=1) # apply cumsum to df instead df1
s2 = df2.notnull().sum(axis=1)
pd.concat([s1, s2], axis=1).T.to_dict('list')

wrong result(same to questioner's result that he think result of my code)

{0: [1, 1],
 1: [2, 2],
 2: [3, 3],
 3: [0, 0],
 4: [2, 0],
 5: [1, 0],
 6: [0, 0],
 7: [1, 1],
 8: [4, 3],
 9: [5, 2],
 10: [4, 2],
 11: [6, 1],
 12: [5, 1]}

It is common for the person to apply and get different results, but that should be checked by the person himself before endless question.

相关问题