R语言 如何使用data.table高效地创建新变量并分配列名?

lxkprmvk  于 2023-02-06  发布在  其他
关注(0)|答案(2)|浏览(176)

我需要根据如下所示的数据计算新列:

structure(list(english_score = c(3L, 4L, 3L, 3L, 4L, 3L, 4L, 
2L, 4L, 2L, 3L, 3L, 2L, 2L, 3L, 4L, 3L, 3L, 4L, 3L, 4L, 3L, 2L
), math_score = c(4L, 4L, 3L, 4L, 4L, 4L, 3L, 2L, 3L, 3L, 4L, 
2L, 4L, 2L, 4L, 2L, 3L, 3L, 2L, 2L, 2L, 4L, 2L), science_score = c(3L, 
4L, 4L, 4L, 3L, 4L, 4L, 3L, 3L, 2L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 
2L, 3L, 2L, 3L, 3L, 4L)), row.names = c(NA, -23L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x000002478ee34d50>)

我想制作这样的东西:

structure(list(english_score = c(3L, 4L, 3L, 3L, 4L, 3L, 4L, 
2L, 4L, 2L, 3L, 3L, 2L, 2L, 3L, 4L, 3L, 3L, 4L, 3L, 4L, 3L, 2L
), math_score = c(4L, 4L, 3L, 4L, 4L, 4L, 3L, 2L, 3L, 3L, 4L, 
2L, 4L, 2L, 4L, 2L, 3L, 3L, 2L, 2L, 2L, 4L, 2L), science_score = c(3L, 
4L, 4L, 4L, 3L, 4L, 4L, 3L, 3L, 2L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 
2L, 3L, 2L, 3L, 3L, 4L), english_level = c("Level C", "Level D", 
"Level C", "Level C", "Level D", "Level C", "Level D", "Level B", 
"Level D", "Level B", "Level C", "Level C", "Level B", "Level B", 
"Level C", "Level D", "Level C", "Level C", "Level D", "Level C", 
"Level D", "Level C", "Level B"), math_level = c("Level D", "Level D", 
"Level C", "Level D", "Level D", "Level D", "Level C", "Level B", 
"Level C", "Level C", "Level D", "Level B", "Level D", "Level B", 
"Level D", "Level B", "Level C", "Level C", "Level B", "Level B", 
"Level B", "Level D", "Level B"), science_level = c("Level C", 
"Level D", "Level D", "Level D", "Level C", "Level D", "Level D", 
"Level C", "Level C", "Level B", "Level C", "Level D", "Level D", 
"Level D", "Level D", "Level D", "Level D", "Level B", "Level C", 
"Level B", "Level C", "Level C", "Level D")), row.names = c(NA, 
-23L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 
0x000002478ee34d50>)

到目前为止,我的方法一直是使用一个函数来计算新变量的水平...

myfunction<-function(x){case_when(x<2~"Level A",
                              x>1 & x<3~"Level B",
                              x>2 & x<4~"Level C",
                              x>3~"Level D")}

....然后,创建新变量并逐一为其赋值。

DT[, english_level:=lapply(.SD, myfunction), .SDcols='english_score']

DT[, math_level:=lapply(.SD, myfunction), .SDcols='math_score']

DT[, science_level:=lapply(.SD, myfunction), .SDcols='science_score']

如何简化这个过程,最好使用data.table?

qhhrdooz

qhhrdooz1#

这里有一个选项,您可以避免创建自己的函数,而是创建一个Map表,然后将每个分数Map到年级。

grade_map = data.table(
  score = c(1:5),
  lvl = paste(rep("Level", 5), c("A","B","C","D","F"))
)

lvl_cols = gsub("score","level", names(orig))
score_cols = names(orig)

# Group by Row
orig[, 
  (lvl_cols) := lapply(.SD, function(x) {grade_map[score %in% x]$lvl}), 
  by = 1:nrow(orig), 
  .SDcols = score_cols
]

# Using merge
orig[, 
  (lvl_cols) := lapply(.SD, function(x) {
    merge(data.table(score = x), grade_map, by = "score", sort = F)$lvl
  }), 
  .SDcols = score_cols
]
english_score math_score science_score english_level math_level science_level
 1:             3          4             3       Level C    Level D       Level C
 2:             4          4             4       Level D    Level D       Level D
 3:             3          3             4       Level C    Level C       Level D
 4:             3          4             4       Level C    Level D       Level D
 5:             4          4             3       Level D    Level D       Level C
 6:             3          4             4       Level C    Level D       Level D
 7:             4          3             4       Level D    Level C       Level D
 8:             2          2             3       Level B    Level B       Level C
 9:             4          3             3       Level D    Level C       Level C
10:             2          3             2       Level B    Level C       Level B
11:             3          4             3       Level C    Level D       Level C
12:             3          2             4       Level C    Level B       Level D
13:             2          4             4       Level B    Level D       Level D
14:             2          2             4       Level B    Level B       Level D
15:             3          4             4       Level C    Level D       Level D
16:             4          2             4       Level D    Level B       Level D
17:             3          3             4       Level C    Level C       Level D
18:             3          3             2       Level C    Level C       Level B
19:             4          2             3       Level D    Level B       Level C
20:             3          2             2       Level C    Level B       Level B
21:             4          2             3       Level D    Level B       Level C
22:             3          4             3       Level C    Level D       Level C
23:             2          2             4       Level B    Level B       Level D
c7rzv4ha

c7rzv4ha2#

我会这样做(我把你的数据叫做DT,因为utils::data()是一个基R函数):

score_cols  <- grep("_score$", names(DT), value = TRUE)
level_cols  <- sub("_score", "_level", score_cols)

DT[, 
    (level_cols) := lapply(.SD, myfunction),
    .SDcols = score_cols
]

另外,您的myfunction()使用dplyr::case_when()。这将工作,但一些dplyr函数与data.table冲突(between()first()last()与我目前的版本)。您可以用data.table::fcase()替换它。

myfunction <- function(x) {
    fcase(
        x == 1, "Level A",
        x == 2, "Level B",
        x == 3, "Level C",
        x == 4, "Level D"
    )
}

这应该也比dplyr版本快。
此外,使用这个特定的函数,实际上可以将case when type logic替换为将字母表中的第n个字母指定为一个等级:

assign_letter_grade  <- function(n) {
    paste("Level", LETTERS[n])
}

相关问题