在dataframe中使用4列创建层次结构- pandas

yizd12fk  于 2023-06-04  发布在  其他
关注(0)|答案(1)|浏览(154)

数据框架在下面

ID        ParentID   Filter Text
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL

我需要像这样添加一个额外的列层次结构:

ID        ParentID   Filter Text   Hierarchy
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD      DD/CC/BB/AA/LL
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG      GG/HH/II/JJ/KK
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL

我看到的规则如下:
1.仅填充已填充过滤器值的行的层次结构列,其余行不需要完成层次结构。
1.当发现行的筛选值不为空时,查找其parentID,然后在ID列中搜索此parentid。当发现隐遁继续向上,直到父ID为0。
1.尝试使用itertools执行此操作,但由于原始数据集非常庞大,因此循环时间太长
4)记录集大小约为200k
结果(运行下面的代码,得到下面的错误):

Mapper warning: Could not load the module “fastcluster”.
The module “scipy.cluster.hierarchy“ is used instead, but it will be slower.
The 'cmappertools' module could not be imported.
The 'cmappertools' module could not be imported.
Intrinsic metric is not available.
The 'cmappertools' module could not be imported.
Traceback (most recent call last):
  File "C:\Users\xx\graphtext.py", line 18, in <module>
    df.loc[m, 'Hierarchy'] = [
  File "C:\Users\xx\graphtext.py", line 19, in <listcomp>
    ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
  File "C:\Users\xx\graphtext.py", line 19, in <listcomp>
    ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
  File "C:\Users\xx\graphtext.py", line 19, in <listcomp>
    ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
AttributeError: module 'mapper' has no attribute 'get'
jmo0nnb3

jmo0nnb31#

这是一个图形问题,您可以使用networkx轻松解决。

import networkx as nx

m = df['Filter'].notna()
nodes = df.loc[m, 'ID']

mapper = df[m].set_index('ID')['Text']

# create graph
G = nx.from_pandas_edgelist(df, source='ParentID', target='ID',
                            create_using=nx.DiGraph)

# find roots
roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}

# retrieve hierarchy
df.loc[m, 'Hierarchy'] = [
    ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
                        for p in nx.all_simple_paths(G, r, n)])
    for n in nodes for r in roots
    for p in nx.all_simple_paths(G, r, n)
]
  • 请注意,如果图是分支的,则可能存在多个层次结构。在这种情况下,这将返回所有由;分隔的值。

输出:

ID  ParentID  Filter Text       Hierarchy
0      98        97     NaN   AA             NaN
1      99        98     NaN   BB             NaN
2     100        99     NaN   CC             NaN
3     107       100     1.0   DD  DD/CC/BB/AA/LL
4    9999      1231     NaN   EE             NaN
5   10000      1334     NaN   FF             NaN
6   10001       850     2.0   GG  GG/HH/II/JJ/KK
7     850       230     NaN   HH             NaN
8     230       121     NaN   II             NaN
9     121        96     NaN   JJ             NaN
10     96         0     NaN   KK             NaN
11     97         0     NaN   LL             NaN

图表:

潜在优化

如果数据集很大,潜在的优化可能是只迭代作为连接组件的一部分的根。如果这样做能提高性能,您必须在真实的的数据集中进行尝试。

import networkx as nx

m = df['Filter'].notna()
nodes = df.loc[m, 'ID']

mapper = df[m].set_index('ID')['Text']

G = nx.from_pandas_edgelist(df, source='ParentID', target='ID', create_using=nx.DiGraph)

roots = {n for n, deg in G.in_degree() if deg==0}
# {1231, 1334, 0}

roots_dict = {n: s&roots for s in nx.weakly_connected_components(G) for n in s}

df.loc[m, 'Hierarchy'] = [
    ';'.join(['/'.join([mapper.get(x) for x in p[:0:-1]])
                        for p in nx.all_simple_paths(G, r, n)])
    for n in nodes for r in roots_dict[n]
    for p in nx.all_simple_paths(G, r, n)
]

相关问题