python-3.x 在Pandas中有没有一种方法可以根据自定义规则对行进行排序？

下面是我的示例表：

id  nv  ov        date
  1   1   0  01/02/2023
  1   2   1  01/02/2023
  1   5   4  01/03/2023
  1   1   3  01/02/2023
  1   4   1  01/02/2023
  1   3   2  01/02/2023
  1   6   5  01/03/2023
  1   7   6  01/04/2023
  1   7   7  01/04/2023

其中nv是新值，ov是旧值。
我想根据新值和旧值之间的正确更改日志对所有行进行排序，即上一行的新值应与当前行的旧值匹配。
另外，如果有多个新值与旧值匹配的情况，则只向前取正确的情况，例如，在（1，0）之后，我可以转到（2，1）或（4，1）;但（2，1）是正确的日志，因为它将覆盖所有转换。如果此时选择（4，1）作为行号3，则不会覆盖所有更改日志。旧值为空的行将始终是第一行。

（PS：我确实有一个日期列，但在同一天发生了多个转换。在这种情况下，如何确定上述用例中这些转换的正确顺序？例如，从1到2和从1到4的转换发生在同一天。）*

我们可以在pandas/python中执行此操作吗？
预期输出：

id  nv  ov       date  rn
  1   1   0 01/02/2023   1
  1   2   1 01/02/2023   2
  1   3   2 01/02/2023   3
  1   1   3 01/02/2023   4
  1   4   1 01/02/2023   5
  1   5   4 01/03/2023   6
  1   6   5 01/03/2023   7
  1   7   6 01/04/2023   8
  1   7   7 01/04/2023   9

您可以在networkx的帮助下使用图形方法来解决这个问题。
下面是您的图表：

这里我们将确定根和叶（我假设它们是唯一的），并计算排除了all_simple_edge_paths的循环的简单路径，然后我们用simple-cycles确定所有循环，并用dfs_edges迭代它们的边，以添加一个子排序键（这里是排序键的小数部分）
这将为每条边产生以下"顺序"，然后我们将使用该顺序进行排序：

将添加"order"列的sorter函数的代码（前提条件：没有重复的边，只有一个根和叶，尽管可以更新代码以处理更多的情况）：

import networkx as nx

def sorter(df):
    # get root and leafs
    root = df.loc[~df['ov'].isin(df['nv']), 'ov'].squeeze()
    leaf = df.loc[df['ov'].eq(df['nv']), 'ov'].squeeze()

    # build graph
    G = nx.from_pandas_edgelist(df, source='ov', target='nv',
                                create_using=nx.DiGraph)

    # find order of the edges from root to leafs, excluding cycles
    order = {e: i for i, e in enumerate(next(nx.all_simple_edge_paths(G, root, leaf)), start=1)}
    # get nodes that are in this normal path
    nodes = {n for k in order for n in k}

    # compute a step that will be used to add decimals for subsorting
    # this depends on the number of initial rows
    step = 10**-np.ceil(np.log10(df.shape[0]+1))
    # for each cycle
    for c in nx.simple_cycles(G):
        # get subgraph
        common = nodes & set(c)
        G2 = G.subgraph(c)
        for n in common:
            parent_edge = (next(G.predecessors(n)), n)
            edge = None
            # enumerate edges of the cycle in order and
            # add a sorting index from the parent edge + step
            for i, edge in enumerate(nx.dfs_edges(G2, source=n), start=1):
                order[edge] = order[parent_edge] + step * i
            if edge:
                order[(edge[1], n)] = order[parent_edge] + step * (i+1)
            else:
                order[(n, n)] = order[parent_edge] + step

    # build a sorting Series
    s = pd.Series(order).rename_axis(['ov', 'nv']).reset_index(name='order')
    
    # merge the order to original data and return
    return df.merge(s, on=['nv', 'ov'], how='left')

然后运行：

out = sorter(df).sort_values(by=['date', 'order'])

如果要按组排序，请使用groupby.apply中的函数：

out = (df.groupby('id', group_keys=False).apply(sorter)
         .sort_values(by=['id', 'date', 'order'])
       )

输出：

id  nv  ov        date  order
0   1   1   0  01/02/2023    1.0
1   1   2   1  01/02/2023    1.1
5   1   3   2  01/02/2023    1.2
3   1   1   3  01/02/2023    1.3
4   1   4   1  01/02/2023    2.0
2   1   5   4  01/03/2023    3.0
6   1   6   5  01/03/2023    4.0
7   1   7   6  01/04/2023    5.0
8   1   7   7  01/04/2023    5.1

可重现输入：

df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 1, 1, 1, 1],
                   'nv': [1, 2, 5, 1, 4, 3, 6, 7, 7],
                   'ov': [0, 1, 4, 3, 1, 2, 5, 6, 7],
                   'date': ['01/02/2023', '01/02/2023', '01/03/2023', '01/02/2023', '01/02/2023', '01/02/2023', '01/03/2023', '01/04/2023', '01/04/2023']})

python-3.x 在Pandas中有没有一种方法可以根据自定义规则对行进行排序？

1条答案

相关问题

热门标签

最新问答