R语言 计算大型网络上节点i与邻居使用的资源份额

pgx2nnw8  于 2023-03-10  发布在  其他
关注(0)|答案(1)|浏览(124)

目的主要目的是能够计算节点i相对于其邻居所使用的资源份额:r_i / sum_j^i{r_j}

其中r_i是节点i的资源,并且sum_j^i{r_j}是i的邻居的资源的总和。
我对任何R、python或stata解决方案都持开放态度,只要它们能够完成我几乎要放弃的任务......请看下面我以前尝试的片段。
为了实现这个目标,我尝试执行以下类型的搜索:
| 节点|列1|列2|第3栏|
| - ------|- ------|- ------|- ------|
| 我|[答]|[列表]|列表|
| j|[A、B、i]|||
在列1中搜索i如果找到更新列1
| 节点|列1|列2|第3栏|
| - ------|- ------|- ------|- ------|
| 我|[阿、j]|[列表]|列表|
| j|[A、B、i]|||

数据Dataframe约为700 k行,列表最多可包含20个元素。列1-列3中的列表可以为空。存储为字符串的条目类似于“1579301860”。

df:的前10个条目

df[["ID","s22_12","s22_09","s22_04"]].head(10)
,ID,s22_12,s22_09,s22_04
0,547232925,[],[],[]
1,1195452119,[],[],[]
2,543827523,[],[],[]
3,1195453927,[],[],[]
4,1195456863,[],[],[]
5,403735824,[],[],[]
6,403985344,[],[],[]
7,1522725190,"['547232925', '1561895862', '1195453927', '1473969746', '1576299336', '1614620375', '1526127302', '1523072827', '398988727', '1393784634', '1628271142', '1562369345', '1615273511', '1465706815', '1546795725']","['1550103038', '547232925', '1614620375', '1500554025', '1526127302', '1523072827', '1554793443', '1393784634', '1603417699', '1560658585', '1533511207', '1439071476', '1527861165', '1539382728', '1545880720']","['1529732185', '1241865116', '1524579382', '1523072827', '1526127302', '1560851415', '1535455909', '1457280850', '1577015775', '1600877852', '1549989930', '1528007558', '1533511207', '1527861165', '1591602766']"
8,789656124,[],[],[]
9,662539468,[1195453927],[],[]

我尝试的是:R尝试分解列表并以长格式放置。然后我尝试了R中的两种主要方法:

1.将长数据加载到igraph中,然后应用于节点的图neighbors(),保存到列表中,并使用plyr获得neighbor_df(工作,但2个节点在67秒内完成)

# Initialize the result data frame
result <- data.frame(Node = nodes)
#result <- as.data.frame(matrix(NA, nrow = n_nodes, ncol = 0))
neighbor_lists <- lapply(nodes, function(x) {
  neighbors <- names(neighbors(graph, x))
  if (length(neighbors) == 0) {
    neighbors <- NA
  }
  return(neighbors)
})
neighbor_df <- plyr::ldply(neighbor_lists, rbind)
names(neighbor_df) <- paste0("Neighbor",1:ncol(neighbor_df))
result <- cbind(result,neighbor_df)

1.使用data.table,split读取长格式,在拆分时应用dcast(〈-内存过载)

result_long <- edges[, .(to = to, Node = from)][, rn := .I][,   .(Node, Neighbor = to, Number = rn)][order(Number),]
result_long[,cast_cat:=findInterval(Number,seq(100000,6000000,100000))]
# reshape to wide
result_wide <- dcast(result_long, Node ~ Number, value.var = "Neighbor", fill = "")
#Only tested on sample data, target data is 19 mln rows and dcast shall be split, but then it consumes 200Gb of ram
result_wide[, (2:ncol(result_wide)) := lapply(.SD, function(x) ifelse(x == "", NA, x)), .SDcols = 2:ncol(result_wide)]
result_wide = na_move(result_wide, cols = names(result_wide[,!1]) )
result_wide<- Filter(function(x)!all(is.na(x)), result_wide)

我按照安迪的要求贴了出来,但我认为这会让问题变得混乱。

osh3o9ms

osh3o9ms1#

感谢@Stefano Barbi的评论:

# extract attributes characteristics:
r <- vertex_attr(g,"rcount",index=V(g))

#create a dgC sparse matrix from graph
m <- get.adjacency(g)

# premultiply the adj matrix to find the sum of the neighbors resources
sum_of_rj = r %*% m

# add node's own resources
sum_of_r = sum_of_rj + r

#find the vector of shares
share = r / sum_of_r@x

sh_tab = data.table(i = sum_of_r@Dimnames[[2]], sh = share)
sh_tab

相关问题