pandas 查找一列的所有数据框匹配项以获得组合

osh3o9ms  于 2023-01-11  发布在  其他
关注(0)|答案(2)|浏览(208)

假设我有一个DataFrame base_df,内容如下:

0  1   2   3
0 2 'A' 'B' NaN
1 2 'A' 'C' NaN
2 2 'A' NaN 'D'
3 2 'A' NaN 'E'
4 2 'A' NaN 'F'

如何扩展单元格和列(最好不需要迭代)以生成:

0  1   2   3
0  2 'A' 'B' NaN
1  2 'A' 'C' NaN
2  2 'A' NaN 'D'
3  2 'A' NaN 'E'
4  2 'A' NaN 'F'
5  3 'A' 'B' 'D'
6  3 'A' 'C' 'D'
7  3 'A' 'B' 'E'
8  3 'A' 'C' 'E'
9  3 'A' 'B' 'F'
10 3 'A' 'C' 'F'

第0列我可以用base_df.count(axis=1)很好地处理,但是我的解决方案通常会迫使我用.iterrows()迭代行。在Pandas中有更好的方法吗?
编辑:我设法解决了这个问题,尽管它的速度还不够快,不足以发挥优势:

DF = pd.DataFrame

in_def = <A STRING-NAN DF>
colspan = len(d.PG_LANGS) + 1
cols = range(1, colspan)
for keep_len in range(3, len(d.PG_LANGS) + 1):
    out_df: DF = DF(columns=range(colspan))
    print('KEEP LEN:', keep_len)
    for dex_a in cols:
        for dex_b in cols:
            if dex_a == dex_b:
                continue
            a_df: DF = in_df[in_df[dex_a].notna()]
            sansb_df: DF = a_df[a_df[dex_b].isna()]
            withb_df: DF = a_df[a_df[dex_b].notna()]
            shared_as: set[str] = \
                set(sansb_df[dex_a]) & set(withb_df[dex_a])  # type: ignore
            for sha in shared_as:
                sansb: DF = \
                    sansb_df[sansb_df[dex_a] == sha]  # type: ignore
                withb: DF = \
                    withb_df[withb_df[dex_a] == sha]  # type: ignore
                # print('SANS', sansb.shape[0])
                # print('WITH', withb.shape[0])
                if sansb.shape[0] == 0:
                    continue
                if withb.shape[0] == 0:
                    continue
                sansb = \
                    pd.concat([sansb] * withb.shape[0],  # type: ignore
                              axis=0, ignore_index=True)
                withb = \
                    pd.concat([withb] * sansb.shape[0],  # type: ignore
                              axis=0, ignore_index=True)
                sansb[dex_b] = withb[dex_b]
                sansb.drop_duplicates(ignore_index=True, inplace=True)
                # print(sansb)
                out_df = \
                    pd.concat([out_df, sansb], axis=0,  # type: ignore
                              ignore_index=True, sort=False)
    out_df.reset_index()
    out_df[0] = out_df.count(axis=1)  # type: ignore
    out_df.drop_duplicates(ignore_index=True, inplace=True)
    print(out_df)
    in_df = out_df
j2datikz

j2datikz1#

您可以使用itertools.productconcat

from itertools import product

cols = ['B', 'C', 'D']

out = pd.concat([df,
                 pd.DataFrame(product([3], *[df[col].dropna().unique()
                                             for col in cols]),
                              columns=['A']+cols)
                ])

print(out)

输出:

A  B    C    D
0  2  A    B  NaN
1  2  A    C  NaN
2  2  A  NaN    D
3  2  A  NaN    E
4  2  A  NaN    F
0  3  A    B    D
1  3  A    B    E
2  3  A    B    F
3  3  A    C    D
4  3  A    C    E
23c0lvtd

23c0lvtd2#

这应该行得通:

new_lines = df.loc[df[2].notna(), [0, 1, 2]].merge(df.loc[df[2].isna(), [3]], how='cross')
new_lines[0] += 1
df = pd.concat([df, new_lines])

输出:

0  1    2    3
0  2  A    B  NaN
1  2  A    C  NaN
2  2  A  NaN    D
3  2  A  NaN    E
4  2  A  NaN    F
0  3  A    B    D
1  3  A    B    E
2  3  A    B    F
3  3  A    C    D
4  3  A    C    E
5  3  A    C    F

相关问题