R语言 在N个合成数据集上连续运行的N个Lassos中区分结构和非结构回归变量候选项

rdrgkggo  于 2023-03-05  发布在  其他
关注(0)|答案(1)|浏览(115)

在这个合作研究项目中,我们正致力于2008年工作文件的第二稿,该文件提出了一个有前途的直接,但新颖的监督统计学习中的最优变量选择算法。在这项研究中探索和评估的新颖变量选择算法已被其创新者,我的合作者,著名的计量经济学家安东尼戴维斯博士创造了“估计穷举回归”。
这些260 k csv文件格式的503 x 31数据集的关键特征是最好地描述、解释并根据30个初始候选项预测真正的基础回归变量的行为(在现代经济学和计量经济学研究的许多部分被称为“结构变量”)在对这些合成样本数据集进行任何分析或任何操作之前,已知每个数据集的参数。这是有意地通过构造来完成的,Davies博士编写脚本的方式,他用来通过蒙特卡罗模拟创建它们。30个候选变量中的哪些是每个变量的真正基础/结构变量的方式非常简单,包括在前2行中,如下图所示:

第一行是30个单元格长的二元指标行,其中1表示候选变量对该数据集是结构性/解释性/预测性的,0表示不是。
现在,我有了下面的代码,这些代码可以很好地将N个数据集加载到R中,并在运行LASSO回归之前对它们进行处理:

# these 2 lines together create a simple character list of 
# all the file names in the file folder of datasets you created
folderpath <- "C:/Users/Spencer/Documents/EER Project/Data/top 50"
paths_list <- list.files(path = folderpath, full.names = T, recursive = T)

# shorten the names of each of the datasets corresponding to 
# each file path in paths_list
DS_names_list <- basename(paths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)

# sort both of the list of file names so that they are in the proper order
my_order = DS_names_list |> 
  # split apart the listed numbers, convert them to numeric 
  strsplit(split = "-", fixed = TRUE) |>  unlist() |> as.numeric() |>
  # get them in a data frame
  matrix(nrow = length(DS_names_list), byrow = TRUE) |> as.data.frame() |>
  # get the appropriate ordering to sort the data frame
  do.call(order, args = _)
DS_names_list = DS_names_list[my_order]
paths_list = paths_list[my_order]

# this line reads all of the data in each of the csv files 
# using the name of each store in the list we just created
CL <- parallel::makeCluster(detectCores() - 4L)
parallel::clusterExport(CL, c('paths_list'))
system.time(datasets <- parLapply(cl = CL, X = paths_list, 
                                  fun = data.table::fread))

# change column names of all the columns in the data.table 'datasets'
datasets <- lapply(datasets, function(dataset_i) { 
  colnames(dataset_i) <- c("Y","X1","X2","X3","X4","X5","X6","X7","X8",
                           "X9","X10","X11","X12","X13","X14","X15",
                           "X16","X17","X18","X19","X20","X21","X22", 
                           "X23","X24","X25","X26","X27","X28","X29","X30")
  dataset_i })

dfs <- lapply(datasets, function(i) {i[-1:-3, ]})
dfs <- lapply(dfs, \(X) { lapply(X, as.numeric) })
dfs <- lapply(dfs, function(i) { as.data.table(i) })

现在,最后,下面是我如何在N个数据集上运行N个LASSO,使用glmnet包的可选LASSO设置(在相同标题的函数中):

set.seed(188)     # to ensure reproducibility
LASSO.fits <- lapply(X = dfs, function(I) 
               glmnet(x = as.matrix(select(I, starts_with("X"))), 
                  y = I$Y, alpha = 1))

# This stores and prints out all of the regression 
# equation specifications selected by LASSO when called
LASSO.coefs = LASSO.fits |> 
  Map(f = \(model) coef(model, s = .1))   
Variables.glmnet.LASSO.Selected <- LASSO.coefs |>
  Map(f = \(matr) matr |> as.matrix() |> 
       as.data.frame() |> filter(s1 != 0) |> rownames())   
Variables.glmnet.LASSO.Selected = lapply(seq_along(dfs), \(j)
                            j <- (Variables.glmnet.LASSO.Selected[[j]][-1]))

其中最后一行可执行代码创建了一个对象,其内容在打印出来时如下所示:

> head(Variables.glmnet.LASSO.Selected, n = 4)
[[1]]
 [1] "X1"  "X2"  "X8"  "X9"  "X10" "X12" "X16" "X17" "X18" "X19" "X20" "X22" "X23" "X26"
[[2]]
 [1] "X1"  "X4"  "X5"  "X6"  "X8"  "X9"  "X13" "X15" "X18" "X19" "X22" "X24" "X25" "X29"
[[3]]
 [1] "X4"  "X5"  "X6"  "X8"  "X10" "X12" "X13" "X14" "X16" "X17" "X18" "X21" "X22" "X25" "X30"

因此,我现在需要的是创建一个并行列表,该列表存储变量名字符串的等效列表,仅捕获glmnet的LASSO Regression在该数据集上选择的候选回归变量,例如:

> head(Variables.glmnet.LASSO.Selected, n = 4)
[[1]]
 [1] "X1"  "X2"  "X8"  "X9"  "X10" "X12" "X16" "X17" "X18" "X19" "X20" "X22" "X23" "X26"
[[2]]
 [1] "X1"  "X4"  "X5"  "X6"  "X8"  "X9"  "X13" "X15" "X18" "X19" "X22" "X24" "X25" "X29"
[[3]]
 [1] "X4"  "X5"  "X6"  "X8"  "X10" "X12" "X13" "X14" "X16" "X17" "X18" "X21" "X22" "X25" "X30"

也就是说,如果LASSO选择的所有前4个 * 规范 *(即方程)都是正确的,并且如果前4个选择的规范不一定是假设正确的,则这4条线中的一条或多条当然可以在多个方向上不同!
注:这里是我在脚本顶部加载的所有包:

# load all necessary packages
library(plyr)
library(dplyr)
library(stringi)
library(stats)
library(leaps)
library(lars)
library(elasticnet)
library(data.table)
library(parallel)
qpgpyjmq

qpgpyjmq1#

尝试以下操作以在源数据集中重新创建第一行:

Structural_or_Non <- lapply(datasets, function(j) {j[1, -1]})

然后,只需使用lapply,将names函数应用于刚刚创建的列表中的每个元素,如下所示:结构变量〈-l应用(结构变量或非结构变量,函数(i){名称(i)[i == 1] })

Nonstructural_Variables <- lapply(Structural_or_Non, function(i) {
  names(i)[i == 0] })

这应该能帮你。

相关问题