pandas 如何在python的数据框中创建id列的层次路径?

uz75evzq  于 2023-01-24  发布在  Python
关注(0)|答案(2)|浏览(112)

我有一个 Dataframe ,其中有parent_id,parent_name,id,name,last_category列。df是这样的:

parent_id   parent_name id      name    last_category
NaN         NaN         1       b       0
1           b           11      b1      0
11          b1          111     b2      0
111         b2          1111    b3      0
1111        b3          11111   b4      1
NaN         NaN         2       a       0
2           a           22      a1      0
22          a1          222     a2      0
222         a2          2222    a3      1

我想为df创建一个last_category列为1的层次路径,从根目录到最后一个目录,所以我将创建的新 Dataframe 应该是这样的(df_last):

name_path                id_path
b / b1 / b2 / b3 / b4    1 / 11 / 111 / 1111 / 11111
a / a1 / a2 / a3 / a4    2 / 22 / 222 / 2222

如何做到这一点?

x33g5p2x

x33g5p2x1#

只使用numpy和panda的解决方案:

# It's easier if we index the dataframe with the `id`
# I assume this ID is unique
df = df.set_index("id")

# `parents[i]` returns the parent ID of `i`
parents = df["parent_id"].to_dict()

paths = {}

# Find all nodes with last_category == 1
for id_ in df.query("last_category == 1").index:
    child_id = id_
    path = [child_id]
    
    # Iteratively travel up the hierarchy until the parent is nan
    while True:
        pid = parents[id_]
        if np.isnan(pid):
            break
        else:
            path.append(pid)
            id_ = pid

    # The path to the child node is the reverse of
    # the path we traveled
    paths[int(child_id)] = np.array(path[::-1], dtype="int")

构造结果 Dataframe :

result = pd.DataFrame({
    id_: (
        " / ".join(df.loc[pids, "name"]),
        " / ".join(pids.astype("str"))
    )
    for id_, pids in paths.items()
}, index=["name_path", "id_path"]).T
b4lqfgs4

b4lqfgs42#

您可以使用networkx,通过all_simple_paths函数解析根节点和叶节点之间的路径。

# Python env: pip install networkx
# Anaconda env: conda install networkx
import networkx as nx

# Create network from your dataframe
G = nx.from_pandas_edgelist(df, source='parent_id', target='id',
                            create_using=nx.DiGraph)
nx.set_node_attributes(G, df.set_index('id')[['name']].to_dict('index'))

# Find roots of your graph (a root is a node with no input)
roots = [node for node, degree in G.in_degree() if degree == 0]

# Find leaves of your graph (a leaf is a node with no output)
leaves = [node for node, degree in G.out_degree() if degree == 0]

# Find all paths
paths = []
for root in roots:
  for leaf in leaves:
    for path in nx.all_simple_paths(G, root, leaf):
        # [1:] to remove NaN parent_id
        paths.append({'id_path': ' / '.join(str(n) for n in path[1:]),
                      'name_path': ' / '.join(G.nodes[n]['name'] for n in path[1:])})

out = pd.DataFrame(paths)

输出:

>>> out
                       id_path              name_path
0  1 / 11 / 111 / 1111 / 11111  b / b1 / b2 / b3 / b4
1          2 / 22 / 222 / 2222       a / a1 / a2 / a3

相关问题