在这个合作研究项目中,我们正致力于2008年工作文件的第二稿,该文件提出了一个有前途的直接,但新颖的监督统计学习中的最优变量选择算法。在这项研究中探索和评估的新颖变量选择算法已被其创新者,我的合作者,著名的计量经济学家安东尼戴维斯博士创造了“估计穷举回归”。
这些260 k csv文件格式的503 x 31数据集的关键特征是最好地描述、解释并根据30个初始候选项预测真正的基础回归变量的行为(在现代经济学和计量经济学研究的许多部分被称为“结构变量”)在对这些合成样本数据集进行任何分析或任何操作之前,已知每个数据集的参数。这是有意地通过构造来完成的,Davies博士编写脚本的方式,他用来通过蒙特卡罗模拟创建它们。30个候选变量中的哪些是每个变量的真正基础/结构变量的方式非常简单,包括在前2行中,如下图所示:
第一行是30个单元格长的二元指标行,其中1表示候选变量对该数据集是结构性/解释性/预测性的,0表示不是。
现在,我有了下面的代码,这些代码可以很好地将N个数据集加载到R中,并在运行LASSO回归之前对它们进行处理:
# these 2 lines together create a simple character list of
# all the file names in the file folder of datasets you created
folderpath <- "C:/Users/Spencer/Documents/EER Project/Data/top 50"
paths_list <- list.files(path = folderpath, full.names = T, recursive = T)
# shorten the names of each of the datasets corresponding to
# each file path in paths_list
DS_names_list <- basename(paths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)
# sort both of the list of file names so that they are in the proper order
my_order = DS_names_list |>
# split apart the listed numbers, convert them to numeric
strsplit(split = "-", fixed = TRUE) |> unlist() |> as.numeric() |>
# get them in a data frame
matrix(nrow = length(DS_names_list), byrow = TRUE) |> as.data.frame() |>
# get the appropriate ordering to sort the data frame
do.call(order, args = _)
DS_names_list = DS_names_list[my_order]
paths_list = paths_list[my_order]
# this line reads all of the data in each of the csv files
# using the name of each store in the list we just created
CL <- parallel::makeCluster(detectCores() - 4L)
parallel::clusterExport(CL, c('paths_list'))
system.time(datasets <- parLapply(cl = CL, X = paths_list,
fun = data.table::fread))
# change column names of all the columns in the data.table 'datasets'
datasets <- lapply(datasets, function(dataset_i) {
colnames(dataset_i) <- c("Y","X1","X2","X3","X4","X5","X6","X7","X8",
"X9","X10","X11","X12","X13","X14","X15",
"X16","X17","X18","X19","X20","X21","X22",
"X23","X24","X25","X26","X27","X28","X29","X30")
dataset_i })
dfs <- lapply(datasets, function(i) {i[-1:-3, ]})
dfs <- lapply(dfs, \(X) { lapply(X, as.numeric) })
dfs <- lapply(dfs, function(i) { as.data.table(i) })
现在,最后,下面是我如何在N个数据集上运行N个LASSO,使用glmnet包的可选LASSO设置(在相同标题的函数中):
set.seed(188) # to ensure reproducibility
LASSO.fits <- lapply(X = dfs, function(I)
glmnet(x = as.matrix(select(I, starts_with("X"))),
y = I$Y, alpha = 1))
# This stores and prints out all of the regression
# equation specifications selected by LASSO when called
LASSO.coefs = LASSO.fits |>
Map(f = \(model) coef(model, s = .1))
Variables.glmnet.LASSO.Selected <- LASSO.coefs |>
Map(f = \(matr) matr |> as.matrix() |>
as.data.frame() |> filter(s1 != 0) |> rownames())
Variables.glmnet.LASSO.Selected = lapply(seq_along(dfs), \(j)
j <- (Variables.glmnet.LASSO.Selected[[j]][-1]))
其中最后一行可执行代码创建了一个对象,其内容在打印出来时如下所示:
> head(Variables.glmnet.LASSO.Selected, n = 4)
[[1]]
[1] "X1" "X2" "X8" "X9" "X10" "X12" "X16" "X17" "X18" "X19" "X20" "X22" "X23" "X26"
[[2]]
[1] "X1" "X4" "X5" "X6" "X8" "X9" "X13" "X15" "X18" "X19" "X22" "X24" "X25" "X29"
[[3]]
[1] "X4" "X5" "X6" "X8" "X10" "X12" "X13" "X14" "X16" "X17" "X18" "X21" "X22" "X25" "X30"
因此,我现在需要的是创建一个并行列表,该列表存储变量名字符串的等效列表,仅捕获glmnet的LASSO Regression在该数据集上选择的候选回归变量,例如:
> head(Variables.glmnet.LASSO.Selected, n = 4)
[[1]]
[1] "X1" "X2" "X8" "X9" "X10" "X12" "X16" "X17" "X18" "X19" "X20" "X22" "X23" "X26"
[[2]]
[1] "X1" "X4" "X5" "X6" "X8" "X9" "X13" "X15" "X18" "X19" "X22" "X24" "X25" "X29"
[[3]]
[1] "X4" "X5" "X6" "X8" "X10" "X12" "X13" "X14" "X16" "X17" "X18" "X21" "X22" "X25" "X30"
也就是说,如果LASSO选择的所有前4个 * 规范 *(即方程)都是正确的,并且如果前4个选择的规范不一定是假设正确的,则这4条线中的一条或多条当然可以在多个方向上不同!
注:这里是我在脚本顶部加载的所有包:
# load all necessary packages
library(plyr)
library(dplyr)
library(stringi)
library(stats)
library(leaps)
library(lars)
library(elasticnet)
library(data.table)
library(parallel)
1条答案
按热度按时间qpgpyjmq1#
尝试以下操作以在源数据集中重新创建第一行:
然后,只需使用lapply,将names函数应用于刚刚创建的列表中的每个元素,如下所示:结构变量〈-l应用(结构变量或非结构变量,函数(i){名称(i)[i == 1] })
这应该能帮你。