pandas 如何从dataframe中的每一列获取唯一值

jvlzgdj9  于 2023-08-01  发布在  其他
关注(0)|答案(2)|浏览(151)

我正在处理一个 Dataframe ,它看起来像这样:

from pandas import DataFrame
    import pandas as pd
    sample = DataFrame([{'ID': 'no1', 'B': 'Eric','C': 'George','D': 'a'},
                    {'ID': 'no1', 'B': 'Eric','C': 'George','D': 'b'},
                    {'ID': 'no1', 'B': 'Eric','C': 'George','D': 'c'},
                    {'ID': 'no1', 'B': 'Eric','C': 'Genna','D': 'a'},
                    {'ID': 'no1', 'B': 'Eric','C': 'Genna','D': 'b'},
                    {'ID': 'no1', 'B': 'Eric','C': 'Genna','D': 'c'},
                    {'ID': 'no1', 'B': 'aa','C': 'George','D': 'a'},
                    {'ID': 'no1', 'B': 'aa','C': 'George','D': 'b'},
                    {'ID': 'no1', 'B': 'aa','C': 'George','D': 'c'},
                    {'ID': 'no1', 'B': 'aa','C': 'Genna','D': 'a'},
                    {'ID': 'no1', 'B': 'aa','C': 'Genna','D': 'b'},
                    {'ID': 'no1', 'B': 'aa','C': 'Genna','D': 'c'},
                    {'ID': 'no2', 'B': 'Cythina','C': 'Oliver','D': 'x'},
                     {'ID': 'no2', 'B': 'Cythina','C': 'Oliver','D': 'y'},
                     {'ID': 'no2', 'B': 'Cythina','C': 'Olivia','D': 'x'},
                     {'ID': 'no2', 'B': 'Cythina','C': 'Olivia','D': 'y'},
                     {'ID': 'no2', 'B': 'Ben','C': 'Oliver','D': 'x'},
                     {'ID': 'no2', 'B': 'Ben','C': 'Oliver','D': 'y'},
                     {'ID': 'no2', 'B': 'Ben','C': 'Olivia','D': 'x'},
                      {'ID': 'no2', 'B': 'Ben','C': 'Olivia','D': 'y'},
                    ])

字符串
它目前看起来像这样:

ID  B          C    D
0   no1 Eric    George  a
1   no1 Eric    George  b
2   no1 Eric    George  c
3   no1 Eric    Genna   a
4   no1 Eric    Genna   b
5   no1 Eric    Genna   c
6   no1 aa      George  a
7   no1 aa      George  b
8   no1 aa      George  c
9   no1 aa      Genna   a
10  no1 aa      Genna   b
11  no1 aa      Genna   c
12  no2 Cythina Oliver  x
13  no2 Cythina Oliver  y
14  no2 Cythina Olivia  x
15  no2 Cythina Olivia  y
16  no2 Ben     Oliver  x
17  no2 Ben     Oliver  y
18  no2 Ben     Olivia  x
19  no2 Ben     Olivia  y


BCD列在每列之间没有关系。我希望每个BCD列和按ID分组的唯一值--B列中的唯一/独特值,C列中的独特值,D列中的独特值,如下所示:

ID B       C       D
0   no1 Eric    George  a
1   no1 aa      Genna   b
2   no1 NULL    NULL    c
3   no2 Cythina Oliver  x
4   no2 Ben     Olivia  y


一些ID在B下可能有13个唯一值,在C下没有值,在D下可能有5个唯一值。它确实有规律。

qzwqbdag

qzwqbdag1#

IIUC,你可以试试itertools.zip_longest

from itertools import zip_longest

def fn(x):
    b = x['B'].unique()
    c = x['C'].unique()
    d = x['D'].unique()
    return pd.DataFrame(zip_longest(b, c, d), columns=['B', 'C', 'D'])

out = sample.groupby('ID').apply(fn).droplevel(level=1).reset_index()
print(out)

字符串
图纸:

ID        B       C  D
0  no1     Eric  George  a
1  no1       aa   Genna  b
2  no1     None    None  c
3  no2  Cythina  Oliver  x
4  no2      Ben  Olivia  y

ssm49v7z

ssm49v7z2#

这里有一个方法:

(df.set_index('ID')
.where(lambda x: x.apply(lambda x: ~x.duplicated()))
.stack()
.to_frame()
.assign(cc = lambda x: x.groupby(level=[0,1]).cumcount())
.set_index('cc',append=True)[0]
.unstack(level=1)
.droplevel(1)
.reset_index())

字符串
输出量:

ID        B       C  D
0  no1     Eric  George  a
1  no1       aa   Genna  b
2  no1      NaN     NaN  c
3  no2  Cythina  Oliver  x
4  no2      Ben  Olivia  y

相关问题