我遇到了一个问题。我创造了一个拥有数百万用户的游戏。每个用户都是默认的“团队领导者”,但允许将其领导者更改为任何其他用户。多个用户可以在这里创建循环依赖。所以我有一个大的数据框,看起来像(这里是一个例子):
user_id | leader_id | power
---|---|----
1 | 1 | 10
2 | 3 | 20
3 | 2 | 30
4 | 5 | 40
5 | 5 | 50
字符串
我对整个团队不感兴趣,只对你团队中排名低于你的人感兴趣。因此,我感兴趣的“团队”应该是这样的(注意,用户4在自己的团队中,因为他们没有领导任何人,但用户5有一个团队[4,5],因为他们领导用户4)。
user_id | leader_id | power | team
---|---|----|----
1 | 1 | 10 | [1]
2 | 3 | 20 | [2, 3]
3 | 2 | 30 | [2, 3]
4 | 5 | 40 | [4]
5 | 5 | 50 | [4, 5]
型
现在,我要做的就是创建一个列,这是你的团队的平均功率水平。这听起来很简单,但我无法让它在数百万行数据上快速运行。我试过使用networkx,但它基本上是通过每个人的步骤,并重新计算整个树的每一次。我试着写自己的函数来缓存它,但我仍然在挣扎。我写了一个带有测试的示例脚本。如果有人能帮助我,我将不胜感激。我愿意接受任何其他计算方法。当我运行这个脚本时,networkx方法大约需要53秒,而另一种方法大约需要38秒。我希望能更快地得到它,因为这只是在50,000行样本数据上。
import time
import networkx as nx
import numpy as np
import pandas as pd
from tests.pandas_test_utils import assert_frame_equal
def get_team_power_using_networkx(df):
G = nx.from_pandas_edgelist(df, source='user_id', target='leader_id', create_using=nx.DiGraph())
ancestors = {node: nx.ancestors(G, node) for node in G.nodes()}
user_ids = df['user_id'].values
team_ids_list = [list(ancestors.get(user_id, set()) | {user_id}) for user_id in user_ids]
hierarchal_scores = [df.loc[df['user_id'].isin(team_ids), "power"].dropna().values for team_ids in team_ids_list]
avg_scores = [np.sum(scores) / len(scores) if len(scores) > 0 else np.nan for scores in hierarchal_scores]
df["team_power"] = np.where(np.isnan(avg_scores), np.nan, np.round(avg_scores).astype(int))
return df
def get_team_power_using_custom(df):
G = nx.from_pandas_edgelist(df, source='user_id', target='leader_id', create_using=nx.DiGraph())
def get_ancestors(node):
if node in ancestors:
return ancestors[node]
ancestors[node] = set()
for parent in G.predecessors(node):
ancestors[node].add(parent)
ancestors[node].update(get_ancestors(parent))
return ancestors[node]
ancestors = {}
user_ids = df['user_id'].values
team_ids_list = [list(get_ancestors(user_id) | {user_id}) for user_id in user_ids]
hierarchal_scores = [df.loc[df['user_id'].isin(team_ids), "power"].dropna().values for team_ids in team_ids_list]
avg_scores = [np.sum(scores) / len(scores) if len(scores) > 0 else np.nan for scores in hierarchal_scores]
df["team_power"] = np.where(np.isnan(avg_scores), np.nan, np.round(avg_scores).astype(int))
return df
## Test if it works
"""
user 1 is alone. team is [1] and average power is [10]
user 2 and 3 have each other as leaders. both have teams consisting of [2, 3] and average power is [25]
user 4 reports to user 5. team is [4] and average power is [40]
user 5 reports to themself. team is [4, 5, 6, 7] and average power is [53] -> (40 + 50 + 70) / 3
user 6 reports to user 5. team is [6, 7] and average power is [70]
user 7 reports to user 6 who reports to user 5. team is [7] and average power is [70]
user 8 reports to themself. team is [8] and average power is [np.nan]
"""
user_id = [1, 2, 3, 4, 5, 6, 7, 8]
leader_id = [1, 3, 2, 5, 5, 5, 6, 8]
power = [10, 20, 30, 40, 50, np.nan, 70, np.nan]
# Create the DataFrame
data = {'user_id': user_id, 'leader_id': leader_id, 'power': power, }
df = pd.DataFrame(data)
for func in [get_team_power_using_networkx, get_team_power_using_custom]:
print(f"Testing: {func.__name__}")
results = func(df)[["user_id", "team_power"]]
expected_results = pd.DataFrame({"user_id": [1, 2, 3, 4, 5, 6, 7, 8], "team_power": [10, 25, 25, 40, 53, 70, 70, np.nan]})
assert_frame_equal(results, expected_results)
## Test speed
num_rows = 50000
np.random.seed(42)
# Generate random values for each column
user_id = np.random.randint(1, 1000, size=num_rows)
leader_id = np.random.randint(1, 1000, size=num_rows)
power = np.random.uniform(0, 100, size=num_rows)
# Create the DataFrame
data = {'user_id': user_id, 'leader_id': leader_id, 'power': power, }
df = pd.DataFrame(data)
for func in [get_team_power_using_networkx, get_team_power_using_custom]:
start_time = time.time() # Record the start time
func(df)
end_time = time.time() # Record the end time
print(f"Time to run {func.__name__}:", end_time - start_time)
型
1条答案
按热度按时间lb3vh1jj1#
IIUC,您可以:
字符串
图纸:
型
初始
df
:型
UPDATE:要递归计算团队:
型
图纸:
型