目的主要目的是能够计算节点i
相对于其邻居所使用的资源份额:r_i / sum_j^i{r_j}
其中r_i
是节点i的资源,并且sum_j^i{r_j}
是i的邻居的资源的总和。
我对任何R、python或stata解决方案都持开放态度,只要它们能够完成我几乎要放弃的任务......请看下面我以前尝试的片段。
为了实现这个目标,我尝试执行以下类型的搜索:
| 节点|列1|列2|第3栏|
| - ------|- ------|- ------|- ------|
| 我|[答]|[列表]|列表|
| j|[A、B、i]|||
在列1中搜索i如果找到更新列1
| 节点|列1|列2|第3栏|
| - ------|- ------|- ------|- ------|
| 我|[阿、j]|[列表]|列表|
| j|[A、B、i]|||
数据Dataframe约为700 k行,列表最多可包含20个元素。列1-列3中的列表可以为空。存储为字符串的条目类似于“1579301860”。
df:的前10个条目
df[["ID","s22_12","s22_09","s22_04"]].head(10)
,ID,s22_12,s22_09,s22_04
0,547232925,[],[],[]
1,1195452119,[],[],[]
2,543827523,[],[],[]
3,1195453927,[],[],[]
4,1195456863,[],[],[]
5,403735824,[],[],[]
6,403985344,[],[],[]
7,1522725190,"['547232925', '1561895862', '1195453927', '1473969746', '1576299336', '1614620375', '1526127302', '1523072827', '398988727', '1393784634', '1628271142', '1562369345', '1615273511', '1465706815', '1546795725']","['1550103038', '547232925', '1614620375', '1500554025', '1526127302', '1523072827', '1554793443', '1393784634', '1603417699', '1560658585', '1533511207', '1439071476', '1527861165', '1539382728', '1545880720']","['1529732185', '1241865116', '1524579382', '1523072827', '1526127302', '1560851415', '1535455909', '1457280850', '1577015775', '1600877852', '1549989930', '1528007558', '1533511207', '1527861165', '1591602766']"
8,789656124,[],[],[]
9,662539468,[1195453927],[],[]
我尝试的是:R尝试分解列表并以长格式放置。然后我尝试了R中的两种主要方法:
1.将长数据加载到igraph
中,然后应用于节点的图neighbors(),保存到列表中,并使用plyr获得neighbor_df(工作,但2个节点在67秒内完成)
# Initialize the result data frame
result <- data.frame(Node = nodes)
#result <- as.data.frame(matrix(NA, nrow = n_nodes, ncol = 0))
neighbor_lists <- lapply(nodes, function(x) {
neighbors <- names(neighbors(graph, x))
if (length(neighbors) == 0) {
neighbors <- NA
}
return(neighbors)
})
neighbor_df <- plyr::ldply(neighbor_lists, rbind)
names(neighbor_df) <- paste0("Neighbor",1:ncol(neighbor_df))
result <- cbind(result,neighbor_df)
1.使用data.table
,split读取长格式,在拆分时应用dcast(〈-内存过载)
result_long <- edges[, .(to = to, Node = from)][, rn := .I][, .(Node, Neighbor = to, Number = rn)][order(Number),]
result_long[,cast_cat:=findInterval(Number,seq(100000,6000000,100000))]
# reshape to wide
result_wide <- dcast(result_long, Node ~ Number, value.var = "Neighbor", fill = "")
#Only tested on sample data, target data is 19 mln rows and dcast shall be split, but then it consumes 200Gb of ram
result_wide[, (2:ncol(result_wide)) := lapply(.SD, function(x) ifelse(x == "", NA, x)), .SDcols = 2:ncol(result_wide)]
result_wide = na_move(result_wide, cols = names(result_wide[,!1]) )
result_wide<- Filter(function(x)!all(is.na(x)), result_wide)
我按照安迪的要求贴了出来,但我认为这会让问题变得混乱。
1条答案
按热度按时间osh3o9ms1#
感谢@Stefano Barbi的评论: