R语言 连接六方宾对象(或其他方法,以迭代构建具有超大数据集的六方宾图)

3htmauhk  于 2023-03-27  发布在  其他
关注(0)|答案(2)|浏览(130)

我有一个包含12,000名参与者的数据集。(一个矩阵表示两个大脑区域之间的距离,另一个矩阵表示两个大脑区域随时间的连通性的相关性)。我想把所有区域的距离x相关性,结合所有参与者,在一个图中。那就是1,555,200,000个数据点,而R不会为一个这样大小的轴分配向量。理想情况下,我还想在其上绘制一条最佳拟合线。
我尝试的策略是使用hexbin包计算每个参与者的一系列十六进制bin中的点数,然后迭代求和,以便所有参与者的每个bin都有计数。但是,我无法弄清楚如何进行求和操作,因为hexbin包中没有concatenate方法(或我所见过的其他等效功能)。
所以基本上我想做这样的事情:

library(mc2d)
library(hexbin)
N=10000
x1<-rpert(N,0,2,4,shape=5)
y1<-rpert(N,2,8,10,shape=5)
x2<-rpert(N,6,8,10,shape=5)
y2<-rpert(N,0,2,8,shape=5)
xc<-c(x1,x2)
yc<-c(y1,y2)

h1<-hexbin(x1,y1,xbnds=c(0,10),ybnds=c(0,10),xbins=100,shape=.75)
h2<-hexbin(x2,y2,xbnds=c(0,10),ybnds=c(0,10),xbins=100,shape=.75)
hc<-hexbin(xc,yc,xbnds=c(0,10),ybnds=c(0,10),xbins=100,shape=.75)

plot(hc)

除了我想从h1和h2生成hc,而不是从分量向量生成hc(因为对于我的实际应用程序来说,分量向量太大而无法保存在内存中),我愿意使用python或其他语言来完成这项工作。

vqlkdk9b

vqlkdk9b1#

我觉得这个行得通:

解压缩hexbin对象

提取值(作为数据框的列)或元数据

unpack_hexbin <- function(x, element = c("cols", "metadata")) {
    element <- match.arg(element)
    get_slots <- function(nm) Map(\(c) getElement(x, c), nm)
    cols <- c("cell", "count", "xcm", "ycm")
    if (element == "cols") return(get_slots(cols) |>
                                  do.call(what = "data.frame"))
    other_slots <- setdiff(slotNames(x), c(cols, "call", "n", "ncells"))
    get_slots(other_slots)
}

合并

获取一个hexbin对象列表。

  • 提取并合并数据列
  • 对与每个像元关联的值求和,跳过NA s
  • firsthexbin对象中提取元数据(假设它们都是一致的!)
  • 把碎片拼回去
combine_hexbin <- function(L) {
    h <- Map(unpack_hexbin, L)
    h2 <- Reduce(\(x,y) 
        merge(x, y, by = c("cell", "xcm", "ycm"), all = TRUE),
           h)
    comb <- apply(h2[,-(1:3)], 1, sum, na.rm = TRUE)
    do.call(new,
            c(list("hexbin"),
              as.list(h2[,1:3]),
              list(count = comb,
                   n = sum(comb),
                   ncells = length(comb)),
              unpack_hexbin(L[[1]], "metadata"),
              call = quote(call("junk", 1))
            ))
}

试试看

使用上述示例中的值:

L <- list(h1, h2)
cc <- combine_hexbin(L)
plot(cc)
lyr7nygr

lyr7nygr2#

Ben的答案非常接近,但hexbin对象中的xcm和ycm是质心,并不是细胞所独有的因此,如果数据完全重叠,则在它们上合并会错误地产生重复。关键信息是,像元ID对于由边界信息定义的格网中的特定十六进制是唯一的(您可以通过比较hexbin中重叠和不重叠的单元格id的重叠范围来发现这一点-或者通过查看hcell 2xy函数输出的x和y坐标)。因此,只要两个hexbin的边界相同,你可以简单地在cellID上合并。
重叠数据的问题重述:

N=10000
x1<-rpert(N,0,2,4,shape=5)
y1<-rpert(N,2,8,10,shape=5)
x2<-rpert(N,0,5,10,shape=5)
y2<-rpert(N,0,5,10,shape=5)

h1<-hexbin(x1,y1,xbnds=c(0,10),ybnds=c(0,10),xbins=100,shape=.75)
h2<-hexbin(x2,y2,xbnds=c(0,10),ybnds=c(0,10),xbins=100,shape=.75)

解决方案(改编自Ben's):

# Get elements from s4 object by name
get_slots <- function(x,nm) Map(\(c) getElement(x, c), nm)

# Unpack hexbin data to be merged in to a dataframe
# Strictly speaking we don't need the xy coordinates, but it is a good error
# check if we have the computation time available.
unpack_hexbin <- function(x) {
  cols <- c("cell", "count", "xcm", "ycm")
  return(cbind(data.frame(get_slots(x,cols)),
                 hcell2xy(x)))
}

# Get columns from a dataframe that should not vary between hexbins to be 
# merged.
getmeta_hexbin <- function(x) {
  varying=c("cell", "count", "xcm", "ycm", "call", "n", "ncells")
  other_slots <- setdiff(slotNames(x), varying)
  get_slots(x,other_slots)
}

# Center of mass calculation for two points, robust to missing data. 
cm<-function(x1,x2,x1w,x2w) {
  i<-x1*x1w
  j<-x2*x2w
  w<-sum(x1w,x2w,na.rm=TRUE)
  return(sum(i,j,na.rm=TRUE)/w)
}

combine_hexbin <- function(a,b) {
    hm <- merge(unpack_hexbin(a), 
                unpack_hexbin(b), 
                by = c("cell","x","y"), 
                all = TRUE)
    if(any(duplicated(hm$cell))) stop("Duplicate cell Id's detected: Do the hexbin objects have the same grid?")
    hm2 <- hm %>% rowwise() %>% mutate(
      count=sum(count.x,count.y,na.rm=TRUE),
      xcm=cm(xcm.x,xcm.y,count.x,count.y),
      ycm=cm(ycm.x,ycm.y,count.x,count.y)
    )
    do.call(new,
            c(list("hexbin"),
              as.list(hm2[,c("cell",
                             "count",
                             "xcm",
                             "ycm")]),
              list(n = sum(hm2$count),
                   ncells = length(hm2)),
              getmeta_hexbin(a),
              call = quote(call("merged hexbin", 1))
            ))
}

plot(combine_hexbin(h1,h2))

相关问题