如何在pandas dataframe中查找所有祖先并计算它们的平均值(包括原始用户)

798qvoo8  于 2023-08-01  发布在  其他
关注(0)|答案(1)|浏览(87)

我遇到了一个问题。我创造了一个拥有数百万用户的游戏。每个用户都是默认的“团队领导者”,但允许将其领导者更改为任何其他用户。多个用户可以在这里创建循环依赖。所以我有一个大的数据框,看起来像(这里是一个例子):

user_id | leader_id | power
---|---|----
1 | 1 | 10
2 | 3 | 20
3 | 2 | 30
4 | 5 | 40
5 | 5 | 50

字符串
我对整个团队不感兴趣,只对你团队中排名低于你的人感兴趣。因此,我感兴趣的“团队”应该是这样的(注意,用户4在自己的团队中,因为他们没有领导任何人,但用户5有一个团队[4,5],因为他们领导用户4)。

user_id | leader_id | power | team
---|---|----|----
1 | 1 | 10 | [1]
2 | 3 | 20 | [2, 3]
3 | 2 | 30 | [2, 3]
4 | 5 | 40 | [4]
5 | 5 | 50 | [4, 5]


现在,我要做的就是创建一个列,这是你的团队的平均功率水平。这听起来很简单,但我无法让它在数百万行数据上快速运行。我试过使用networkx,但它基本上是通过每个人的步骤,并重新计算整个树的每一次。我试着写自己的函数来缓存它,但我仍然在挣扎。我写了一个带有测试的示例脚本。如果有人能帮助我,我将不胜感激。我愿意接受任何其他计算方法。当我运行这个脚本时,networkx方法大约需要53秒,而另一种方法大约需要38秒。我希望能更快地得到它,因为这只是在50,000行样本数据上。

import time

import networkx as nx
import numpy as np
import pandas as pd

from tests.pandas_test_utils import assert_frame_equal

def get_team_power_using_networkx(df):
    G = nx.from_pandas_edgelist(df, source='user_id', target='leader_id', create_using=nx.DiGraph())
    ancestors = {node: nx.ancestors(G, node) for node in G.nodes()}

    user_ids = df['user_id'].values

    team_ids_list = [list(ancestors.get(user_id, set()) | {user_id}) for user_id in user_ids]

    hierarchal_scores = [df.loc[df['user_id'].isin(team_ids), "power"].dropna().values for team_ids in team_ids_list]
    avg_scores = [np.sum(scores) / len(scores) if len(scores) > 0 else np.nan for scores in hierarchal_scores]
    df["team_power"] = np.where(np.isnan(avg_scores), np.nan, np.round(avg_scores).astype(int))

    return df

def get_team_power_using_custom(df):
    G = nx.from_pandas_edgelist(df, source='user_id', target='leader_id', create_using=nx.DiGraph())

    def get_ancestors(node):
        if node in ancestors:
            return ancestors[node]
        ancestors[node] = set()
        for parent in G.predecessors(node):
            ancestors[node].add(parent)
            ancestors[node].update(get_ancestors(parent))
        return ancestors[node]

    ancestors = {}
    user_ids = df['user_id'].values

    team_ids_list = [list(get_ancestors(user_id) | {user_id}) for user_id in user_ids]

    hierarchal_scores = [df.loc[df['user_id'].isin(team_ids), "power"].dropna().values for team_ids in team_ids_list]
    avg_scores = [np.sum(scores) / len(scores) if len(scores) > 0 else np.nan for scores in hierarchal_scores]
    df["team_power"] = np.where(np.isnan(avg_scores), np.nan, np.round(avg_scores).astype(int))

    return df

## Test if it works
"""
user 1 is alone. team is [1] and average power is [10]
user 2 and 3 have each other as leaders. both have teams consisting of [2, 3] and average power is [25]
user 4 reports to user 5. team is [4] and average power is [40]
user 5 reports to themself. team is [4, 5, 6, 7] and average power is [53] -> (40 + 50 + 70) / 3
user 6 reports to user 5. team is [6, 7] and average power is [70]
user 7 reports to user 6 who reports to user 5. team is [7] and average power is [70]
user 8 reports to themself. team is [8] and average power is [np.nan]
"""
user_id = [1, 2, 3, 4, 5, 6, 7, 8]
leader_id = [1, 3, 2, 5, 5, 5, 6, 8]
power = [10, 20, 30, 40, 50, np.nan, 70, np.nan]
# Create the DataFrame
data = {'user_id': user_id, 'leader_id': leader_id, 'power': power, }
df = pd.DataFrame(data)
for func in [get_team_power_using_networkx, get_team_power_using_custom]:
    print(f"Testing: {func.__name__}")
    results = func(df)[["user_id", "team_power"]]
    expected_results = pd.DataFrame({"user_id": [1, 2, 3, 4, 5, 6, 7, 8], "team_power": [10, 25, 25, 40, 53, 70, 70, np.nan]})
    assert_frame_equal(results, expected_results)

## Test speed
num_rows = 50000
np.random.seed(42)
# Generate random values for each column
user_id = np.random.randint(1, 1000, size=num_rows)
leader_id = np.random.randint(1, 1000, size=num_rows)
power = np.random.uniform(0, 100, size=num_rows)
# Create the DataFrame
data = {'user_id': user_id, 'leader_id': leader_id, 'power': power, }
df = pd.DataFrame(data)
for func in [get_team_power_using_networkx, get_team_power_using_custom]:
    start_time = time.time()  # Record the start time
    func(df)
    end_time = time.time()  # Record the end time
    print(f"Time to run {func.__name__}:", end_time - start_time)

lb3vh1jj

lb3vh1jj1#

IIUC,您可以:

groups = df.groupby('leader_id')['user_id'].agg(set).to_dict()

df['team'] = df['user_id'].apply(lambda x: {x} | groups.get(x, set()))
df['team_power'] = df['team'].apply(lambda x: df.loc[df['user_id'].isin(x), 'power'].mean())
print(df)

字符串
图纸:

user_id  leader_id  power       team  team_power
0        1          1   10.0        {1}        10.0
1        2          3   20.0     {2, 3}        25.0
2        3          2   30.0     {2, 3}        25.0
3        4          5   40.0        {4}        40.0
4        5          5   50.0  {4, 5, 6}        45.0
5        6          5    NaN        {6}         NaN


初始df

user_id  leader_id  power
0        1          1   10.0
1        2          3   20.0
2        3          2   30.0
3        4          5   40.0
4        5          5   50.0
5        6          5    NaN


UPDATE:要递归计算团队:

def get_team(team):
    for user in df.loc[df['leader_id'].isin(team), 'user_id']:
        if user not in team:
            team |= get_team(team | {user})
    return team

df['team'] = [get_team({u}) for u in df['user_id']]
df['team_power'] = df['team'].apply(lambda x: df.loc[df['user_id'].isin(x), 'power'].mean())
print(df)


图纸:

user_id  leader_id  power          team  team_power
0        1          1   10.0           {1}   10.000000
1        2          3   20.0        {2, 3}   25.000000
2        3          2   30.0        {2, 3}   25.000000
3        4          5   40.0           {4}   40.000000
4        5          5   50.0  {4, 5, 6, 7}   53.333333
5        6          5    NaN        {6, 7}   70.000000
6        7          6   70.0           {7}   70.000000
7        8          8    NaN           {8}         NaN

相关问题