创建R版本的Excel SUMIF函数,该函数可用于测量连续运行在多个数据集上的模型的性能

5jvtdoz2  于 2023-01-03  发布在  其他
关注(0)|答案(1)|浏览(102)

首先,这个问题是我最近提出的关于堆栈溢出的问题的后续问题,该问题的答案令人满意,但在一个更复杂的领域/应用程序中。
然而,这次,我在许多数据集上所做的选择和这些数据集的相应真实模型上复制了相同的计算/操作,这意味着相同的函数不能直接应用。
而且,一个更大的问题是,当将一个数据集文件夹加载到R中的一个对象中时,而不是将一个数据集加载到一个对象中时,它会自动将31列命名为V1:31,我无法在加载后轻松地为每个数据集重命名列。
所以,与其让

df<- read.csv("0-11-3-462.csv", header = FALSE)
# change column names of all the columns in the dataframe 'df'
colnames(df) <- c("Y", "X1","X2", "X3", "X4","X5", "X6", "X7","X8", "X9",
                  "X10","X11", "X12", "X13","X14", "X15", "X16","X17", 
                  "X18", "X19","X20", "X21", "X22","X23", "X24", "X25",
                  "X26", "X27", "X28","X29", "X30")
True_IVs <- df[1, -1]

结果是:

> str(True_IVs)
'data.frame':   1 obs. of  30 variables:
 $ X1 : chr "0"
 $ X2 : chr "0"
 $ X3 : chr "0"
 $ X4 : chr "1"
 $ X5 : chr "0"
 $ X6 : chr "0"
 $ X7 : chr "0"
 $ X8 : chr "0"

...
我现在有:

filepaths_list <- list.files(path = filepath, full.names = TRUE, recursive = TRUE)
datasets <- lapply(filepaths_list, read.csv, header = FALSE)

True_IVs <- lapply(datasets, function(j) {j[1, -1]})

datasets <- lapply(datasets, function(i) {i[-1:-3, ]})
datasets <- lapply(datasets, \(X) { lapply(X, as.numeric) })

数据集如下所示(其中V实际上一直到V31,当然这只是数据集对象中第一个数据集的头部):

> head(datasets[[1]], n = 5)
                 V1           V2          V3          V4           V5
1 Regressor present            0           0           0            1
2                              1           2           3            4
3                 Y           X1          X2          X3           X4
4       4.119024459 -1.350655759 1.901787258 0.205749783  0.242920532
5       1.737430635   0.26677565 0.054290757 1.510124319 -0.618655652
            V6           V7          V8           V9         V10
1            0            0           0            0           0
2            5            6           7            8           9
3           X5           X6          X7           X8          X9
4 -0.405946237 -0.667673545 0.745735562  0.143317951 1.376182976
5  0.289294477 -0.220927214 0.251479422 -0.094245944 0.792214818

现在跳到上一个问题开始的相同部分,在更高维的情况下,对于IVs_Selected_by_BE,它得出:

> IVs_Selected_by_BE
[[1]]
 [1] "V3"  "V4"  "V5"  "V6"  "V9"  "V11" "V14" "V16" "V18" "V20" "V21"
[12] "V23" "V26" "V27" "V28" "V29" "V31"

[[2]]
 [1] "V3"  "V6"  "V7"  "V8"  "V9"  "V12" "V13" "V14" "V15" "V17" "V18"
[12] "V21" "V22" "V23" "V24" "V25" "V26" "V30"

这是恼人和令人不安的,但后来同样的情况发生(或多或少)与

True_Regressors now as well:
[[1]]
 [1] "V5"  "V11" "V14" "V20" "V21" "V23" "V26" "V27" "V28" "V29" "V31"

[[2]]
 [1] "V7"  "V8"  "V14" "V15" "V17" "V18" "V21" "V22" "V24" "V26" "V30"

注:True_Regressors通过运行以下命令获得:

True_Regressors <- lapply(True_IVs, function(i) { names(i)[i == 1] })
# verus only having to use this for the single dataset case previously
True_Regressors <- names(True_IVs)[True_IVs == 1]

我这里的问题是每个V右边的数字都是错误的,它们都正好大了1。

yeotifhr

yeotifhr1#

对于vroom包,这很简单:

  • 一次性将getwd()中的所有.csv文件读入 Dataframe ,将文件源存储在“source_file”列中:
library(dplyr)
library(vroom)
all_data <- vroom(list.files(pattern = 'csv'), id = 'source_file')

(实际上你并没有读入它们,而是创建了一个查找索引,在需要的时候执行......这使得vroom如此之快)

  • 重命名列(“source_file”除外):
names(all_data)[-1] <- paste0('X', 1:(ncol(all_data)-1))
  • 添加行号(每个源文件)作为第二列:
all_data <- all_data |>
    group_by(source_file) |>
    mutate(row_number = row_number(), .before = 2)

迄今为止的产出:

> all_data |> head(3)
# A tibble: 3 x 38
# Groups:   source_file [1]
  source_f~1 row_n~2    X1    X2    X3    X4    X5    X6    X7    X8    X9   X10
  <chr>        <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 file1.csv        1   418   676   712   243   319    82   699   851   501   207
2 file1.csv        2   688   402   762   964   895   513   424   335   993   119
3 file1.csv        3   135   201    37    13   104   378   661   874   586   302
# ... with 26 more variables: X11 <dbl>, X12 <dbl>, X13 <dbl>, X14 <dbl>,
#   X15 <dbl>, X16 <dbl>, X17 <dbl>, X18 <dbl>, X19 <dbl>, X20 <dbl>,
#   X21 <dbl>, X22 <dbl>, X23 <dbl>, X24 <dbl>, X25 <dbl>, X26 <dbl>,
#   X27 <dbl>, X28 <dbl>, X29 <dbl>, X30 <dbl>, X31 <dbl>, X32 <dbl>,
#   X33 <dbl>, X34 <dbl>, X35 <dbl>, X36 <dbl>, and abbreviated variable names
#   1: source_file, 2: row_number
# i Use `colnames()` to see all variable names
  • filter/select/mutate...根据需要:
True_IVs <- all_data |>
    filter(row_number == 1) |>
    select(X1)

datasets <- all_data |> filter(row_number > 3)

编辑如果您需要跳过前导行,可以通过设置skip参数来实现:例如vroom(..., skip = 2)以跳过行1-2。

如果需要所有前导行,但以后需要重新排序,则可以将列名设置为:

all_data <- vroom(list.files(pattern = 'csv'), 
                  id = 'source_file',
                  col_names = paste('X', 1:n) ## n = column count in source files
                  )

并且如前所述根据行号过滤/重新排序。

相关问题