R dismo::gbm.并行的步进参数选择函数

2uluyalo  于 2023-09-27  发布在  其他
关注(0)|答案(2)|浏览(118)

我有一个工作函数,它被编码为优化并行处理(希望如此)。我仍然不是最精通R的,尤其是函数和迭代。
我希望有人可以帮助我优化我写的函数沿着额外的代码,以帮助计算时间和充分优化并行处理选项。
特别是使用%do%%dopar%,并将额外的代码和并行处理函数移动到函数内部。我似乎无法让%dopar%工作,我不确定这是我的代码、R版本还是库冲突的问题。
我将非常感谢任何关于以更有效的方式获得相同结果的可能方法的建议。

背景:

我正在使用dismo::gbm.step构建gbm模型。gbm.step通过k折交叉验证选择最佳的树数。然而,树的复杂度和学习率的参数仍然需要设置。我知道caret::train是专门为这个任务而构建的,我在学习caret的过程中得到了很多乐趣,特别是它的自适应重采样功能。但是,我的回答是二项分布,caret没有返回二项分布AUC的选项;我想使用AUC来复制我的领域(生态学)中类似的已发表研究。
在后面的分析中,我还使用了dismo::gbm.simplify来识别可能的简化模型。gbm.simplify依赖于在dismo中构建模型时创建的数据,我无法让它在caret中构建的模型上工作。
最后,生态学中的大多数gbm文献都遵循Elith等人描述的方法。2008“A working guide to boosted regression trees”,这是dismo中BRT功能的基础。出于本研究的目的,我想继续使用dismo来构建gbm模型。
我编写的函数测试tree.complexitylearning.rate的几种组合,并返回每个模型的几个性能指标的列表。然后,我将所有lists合并为一个data.frame,以便于排序。

函数的目标

1.从tree.complexitylearning.rate的每次迭代创建gbm模型。
1.为创建的每个gbm模型将$self.statistics$discriminationcv.statistics$discrimination.meanself.statistics$mean.residcv.statistics$deviance.mean存储在list中。
1.删除每个gbm型号以保存空间。
1.将每个列表组合成便于排序的格式。然后删除每个列表。
1.以优化并行处理以及减少计算时间和所用内存的方式执行上述所有操作。

可复制示例使用dismo包中的Anguilla_train数据集

#Load libraries
require(pacman)
p_load(gbm, dismo, TeachingDemos, foreach, doParallel, data.table) 

data(Anguilla_train)

#Identify cores on current system
cores<-detectCores(all.tests = FALSE, logical = FALSE)
cores

#Create training function for gbm.step
step.train.fx=function(tree.com,learn){
  #set seed for reproducibility
  char2seed("StackOverflow", set = TRUE)
  k1<-gbm.step(data=Anguilla_train, 
               gbm.x = 3:13, 
               gbm.y = 2,
               family = "bernoulli", 
               tree.complexity = tree.com,
               learning.rate = learn,
               bag.fraction = 0.7,
               prev.stratify=TRUE,
               n.folds=10,
               n.trees=700,
               step.size=25,
               silent=TRUE,
               plot.main = FALSE,
               n.cores=cores)

  k.out=list(interaction.depth=k1$interaction.depth,
             shrinkage=k1$shrinkage,
             n.trees=k1$n.trees,
             AUC=k1$self.statistics$discrimination,
             cv.AUC=k1$cv.statistics$discrimination.mean,
             deviance=k1$self.statistics$mean.resid,
             cv.deviance=k1$cv.statistics$deviance.mean)  
  return(k.out)
}

#define complexity and learning rate
tree.complexity<-c(1:5)
learning.rate<-c(0.01,0.025,0.005,0.0025,0.001)

#setup parallel backend to use n processors
cl<-makeCluster(cores)
registerDoParallel(cl)

#Run the actual function
foreach(i = tree.complexity) %do% {
  foreach(j = learning.rate) %do% {
    nam=paste0("gbm_tc",i,"lr",j)
    assign(nam,step.train.fx(tree.com=i,learn=j))

  }
}

#Stop parallel
stopCluster(cl)
registerDoSEQ()

#disable scientific notation
options(scipen=999)

#Find all item in workspace that contain "gbm_tc"
train.all<-ls(pattern="gbm_tc")

#cbind each list that contains "gbm_tc"
train.results<-list(do.call(cbind,mget(train.all)))

#Place in a data frame
train.results<- do.call(rbind, lapply(train.results, rbind))
train.results <- data.frame(matrix(unlist(train.results),ncol=7 , byrow=T))

#Change column names
colnames(train.results)<-c("TC","LR","n.trees", "AUC", "cv.AUC", "dev", "cv.dev")

#Round 4:7
train.results[,4:7]<-round(train.results[,4:7],digits=3)

#Sort by cv.dev, cv.AUC, AUC
train.results<-train.results[order(train.results$cv.dev,-train.results$cv.AUC, -train.results$AUC),]

train.results
cpjpxq1n

cpjpxq1n1#

我自己也在想怎么做,你比我做得更好!我想到的一件事是,问题可能出在嵌套的%do%中?作为一个测试,为什么不尝试只对j使用%dopar%,或者看看是否可以将jk矩阵折叠成一个向量,可能是一个包含两项排列的列表,传递给gbm.step?例如

tree.complexity = i[1],
learning.rate = i[2],

请让我知道如果你有任何成功!
编辑:另外,另一个可能的路由是从here%:%

foreach(tree.com = 1:5) %:% foreach(learn = c(0.01,0.025,0.005,0.0025,0.001)) %dopar% {
gbm.step ... return(list(...))}

如果你把tree.comlearn添加到列表中,那么它可能会吐出这些值的一个很好的矩阵。另一种选择:

foreach(tree.com = 1:5, learn = c(0.01,0.025,0.005,0.0025,0.001) %dopar% {
    gbm.step ... return(list(...))}
798qvoo8

798qvoo82#

我使用了您在这里和this post中的一些代码完成了这一任务

p_load(gbm, dismo, TeachingDemos, foreach, doParallel, data.table)
          
#Create grid of hyperparameter and variable combinations  
hyper_grid <- expand.grid(learning.rate = c(0.00005,
                                            0.00001,
                                            0.0005,
                                            0.0001,
                                            0.005,
                                            0.001,
                                            0.05,
                                            0.01,
                                            0.5,
                                            0.1),
                          tree.complexity = seq(1, 3, 1),
                          bag.fraction = seq(0.2, 0.8, 0.05)
            )
            
#Set up cluster to run in parallel
ncores <- detectCores()
cl <- makeCluster(ncores, outfile="FullGBMGridSearchListening.txt")
registerDoParallel(cl)
            
#Run grid search. Be warned, depending on how granular your grid is, this can take a very long time to run despite the higher efficiency of parallelisation.
system.time(hyper_grid_res <- foreach (i=1:nrow(hyper_grid),.packages=c('gbm','TeachingDemos','data.table'),.combine = rbind) 

%dopar% {
                
char2seed("reproducibility", set = TRUE)#sets the seed for reproducibility
                
# train models
gbm.tune <- gbm.step(data=mod.data, 
                     gbm.x = 3:13,
                     gbm.y = 2,
                     family = "bernoulli",
                     tree.complexity = hyper_grid$tree.complexity[i], 
                     learning.rate = hyper_grid$learning.rate[i],
                     bag.fraction = hyper_grid$bag.fraction[i], 
                     max.trees = 20000,
                     verbose = FALSE, 
                     silent = TRUE,
                     plot.main = FALSE)
                
#Extract the info we need for model ranking
tree.complexity <- hyper_grid$tree.complexity[i]
learning.rate <- hyper_grid$learning.rate[i]
bag.fraction <- hyper_grid$bag.fraction[i]
n.trees <- gbm.tune$n.trees
CV_correlation <- gbm.tune$cv.statistics$correlation.mean
CV_deviance_explained <- (((gbm.tune$self.statistics$mean.null- gbm.tune$cv.statistics$deviance.mean)/gbm.tune$self.statistics$mean.null)*100)
                
print(i)#Keep track of progress in listener
                
#Combine desired outputs in a data.table
data.table(tree.complexity = tree.complexity,
learning.rate = learning.rate,
bag.fraction = bag.fraction,
n.trees = n.trees,
CV_correlation = CV_correlation, 
CV_deviance_explained = CV_deviance_explained)
}
)

stopCluster(cl)
            
#Order by deviance explained > number of predictors > n.trees > tree.complexity and place into a new object (I recommend saving this as .csv for subsequent use)
            
Full_BRT_parallel_grid.search.results <- hyper_grid_res %>% dplyr::arrange(desc(CV_deviance_explained),desc(n.trees),tree.complexity)

相关问题