Pandas:将df子集上的重复值替换为Nan,但保留行

nxagd54h  于 2022-12-02  发布在  其他
关注(0)|答案(1)|浏览(174)

I have seen this question in a few other posts but I can't seem to be applying it to my case.
I have a df that looks like this

A       B       C            D      E
--------------------------------------
Mark    NY      Confirmed    Buy    10 
Mark    NY      Confirmed    Buy    22 
Mark    NY      Confirmed    Buy    40 
John    NY      N/A          Sell   55 
John    NY      N/A          Buy    30 
Karl    LA      Confirmed    Buy    22 
Karl    LA      Confirmed    Buy    66 
Karl    LA      Confirmed    Buy    25

and I would like to remove the duplicates wihtout loosing the rows to get something like

A       B       C            D      E
--------------------------------------
Mark    NY      Confirmed    Buy    10 
                                    22 
                                    40 
John    NY      N/A          Sell   55 
                             Buy    30 
Karl    LA      Confirmed    Buy    22 
                                    66 
                                    25

Any help?

ruoxqz4g

ruoxqz4g1#

我复制了pandas.DataFrame对象,如下所示:

import io

import numpy as np
import pandas as pd

# df format
my_df_str = """A       B       C            D      E
--------------------------------------
Mark    NY      Confirmed    Buy    10 
Mark    NY      Confirmed    Buy    22 
Mark    NY      Confirmed    Buy    40 
John    NY      N/A          Sell   55 
John    NY      N/A          Buy    30 
Karl    LA      Confirmed    Buy    22 
Karl    LA      Confirmed    Buy    66 
Karl    LA      Confirmed    Buy    25 
"""

my_df_str = my_df_str.replace('-', '')
df = pd.read_csv(io.StringIO(mystr), sep='\s+', keep_default_na=False)

这给了我:

A   B          C     D   E
0  Mark  NY  Confirmed   Buy  10
1  Mark  NY  Confirmed   Buy  22
2  Mark  NY  Confirmed   Buy  40
3  John  NY        N/A  Sell  55
4  John  NY        N/A   Buy  30
5  Karl  LA  Confirmed   Buy  22
6  Karl  LA  Confirmed   Buy  66
7  Karl  LA  Confirmed   Buy  25

然后找到重复的值并将4列替换为nan

df.loc[df.duplicated(["A", "B", "C", "D"]), ["A", "B", "C", "D"]] = np.nan # find values where all 4 cols have duplicate
df.loc[df.duplicated(["A", "B", "C"]), ["A", "B", "C"]] = np.nan # find values in this filtered df, where first 3 cols are duplicated

这给了我

A    B          C     D   E
0  Mark   NY  Confirmed   Buy  10
1   NaN  NaN        NaN   NaN  22
2   NaN  NaN        NaN   NaN  40
3  John   NY        N/A  Sell  55
4   NaN  NaN        NaN   Buy  30
5  Karl   LA  Confirmed   Buy  22
6   NaN  NaN        NaN   NaN  66
7   NaN  NaN        NaN   NaN  25

为了使它与您想要的df完全一样,我用空字符串""替换了nan值。
df = df.fillna("")
这给了我

A   B          C     D   E
0  Mark  NY  Confirmed   Buy  10
1                             22
2                             40
3  John  NY       N/A    Sell 55
4                        Buy  30
5  Karl  LA  Confirmed   Buy  22
6                             66
7                             25

相关问题