pandas 转换具有树结构的DataFrame

qxsslcnc  于 2023-01-15  发布在  其他
关注(0)|答案(1)|浏览(170)

我有一些树型结构的数据,以DataFrame的形式给出。

level   id  parent_id  type text
0      1    1       <NA>  node    a
1      2   11          1  node    b
2      2   12          1  node    c
3      2   13          1  leaf    d
4      3  111         11  leaf    e
5      3  121         12  leaf    f
6      3  122         12  leaf    g

我想得到一个DataFrame,如下所示:

level       1              2                3            leaf           
attributes id  type text  id  type text    id  type text   id  type text
0           1  node    a  11  node    b   111  leaf    e  111  leaf    e
1           1  node    a  12  node    c   121  leaf    f  121  leaf    f
2           1  node    a  12  node    c   122  leaf    g  122  leaf    g
3           1  node    a  13  leaf    d  <NA>   NaN  NaN   13  leaf    d

我当前的解决方案如下所示:

from functools import reduce
def join_fn(x, y):
    i, df1 = x
    j, df2 = y
    return (
        j,
        pd.merge(df1, df2, left_on=f"id_{i}", right_on=f"parent_id_{j}", how="outer"),
    )

dfs = list(df.groupby("level"))
dfs = [
    (i, df.rename(columns={col: col + f"_{i}" for col in df.columns})) for i, df in dfs
]

_, dfr = reduce(join_fn, dfs)
dfr = dfr.filter([col for col in dfr.columns if col.startswith(("id", "text", "type"))])
idx = dfr.columns.str.split("_", expand=True)
dfr.columns = idx.swaplevel()

其产生如下:
如何获得最后三列,即收集树叶的列?
此外,我对我当前代码的改进持开放态度。

5q4ezhmt

5q4ezhmt1#

这是一个可能的解决方案:

def merge(ldf, rdf, lsuffix, rsuffix=None):
    return ldf.merge(
        rdf,
        how='right',
        left_on='parent_id',
        right_on='id',
        suffixes=(lsuffix, rsuffix),
    ).drop(
        columns=[f'parent_id{lsuffix}', f'level{lsuffix}'],
    )

df = df[df.columns[::-1]]

res = df[df['level'] == df['level'].max()]
for lev in range(df['level'].max() - 1, 1, -1):
    res = merge(res, df[df['level'] == lev], f'_{lev + 1}')

res = merge(res, df[df['level'] == 1], '_2', '_1')
res = res.drop(columns=['parent_id_1', 'level_1'])
res = res[res.columns[::-1]]

for prefix in ('id', 'type', 'text'):
    sub_res = res[[c for c in res.columns if c.startswith(prefix)]]
    sub_res[f'{prefix}_leaf'] = [pd.NA] * len(sub_res)
    res[f'{prefix}_leaf'] = sub_res.ffill(axis=1)[f'{prefix}_leaf']

相关问题