在R中基于多规则生成组的更好方法

zmeyuzjn  于 2023-02-10  发布在  其他
关注(0)|答案(1)|浏览(118)

我有一个包含许多列的数据集,通过每个行值组合,为另一列中的新值确定一组规则。不同的组合是多样的,并非所有列都包含在每个规则中。此外,一些列的微生物名称往往很长。因此,我使用的当前方法(case_when)变得相当凌乱,回顾这些规则变得相当乏味。
我想知道是否有一个更好的方法来做到这一点,更干净,更容易审查?我运行这个数据集有超过70.000观察,所以下面是一个虚拟数据集,可以使用。

col1   col2   col3   col4     col5  col6
1      A      43     string1  AA    verylongnamehere
2      B      22     string2  BB    anotherlongname
3      C      15     string3  CC    yetanotherlongname
4      D      100    string4  DD    hereisanotherlongname
5      E      60     string5  EE    thisisthelastlongname

test <- data.frame(
  col1 = c(1,2,3,4,5),
  col2 = c("A","B","C","D","E"),
  col3 = c(43,22,15,100,60),
  col4 = c("string1","string2","string3","string4","string5"),
  col5 = c("AA","BB","CC","DD","EE"),
  col6 = c("verylongnamehere", "anotherlongname","yetanotherlongname","hereisanotherlongname","thisisthelastlongname")
)

下面的代码是我使用的规则和代码的一个示例:

library(dplyr)

test2 <- test %>%
  mutate(new_col = case_when(
    col1 == 1 & col2 == "A" & col6 == "verylongnamehere" ~ "result1",
    col3 >= 60 & col5 == "DD" ~ "result2",
    col1 %in% c(2,3,4) & 
     col2 %in% c("B","D") & 
     col5 %in% c("BB","CC","DD") & 
     col6 %in% c("anotherlongname","yetanotherlongname") ~ "result3",
    TRUE ~ "result4"
  ))
8yparm6h

8yparm6h1#

如果条件是在电子表格中,那么查看它们可能会更容易。下面是如何从电子表格中读取条件并构建case_when
电子表格表示(conditions.xlsx):

请注意,==%in%被视为默认值,此处未明确包含。
加载条件

library(readxl)
cond <- read_excel('conditions.xlsx')

dput(cond)

structure(list(Result = c("result1", "result2", "result3", "result4"
), col1 = c("1", NA, "c(2, 3, 4)", NA), col2 = c("\"A\"", NA, 
"c(\"B\",\"D\")", NA), col3 = c(NA, ">= 60", NA, NA), col4 = c(NA, 
NA, NA, NA), col5 = c(NA, "\"DD\"", "c(\"BB\",\"CC\",\"DD\")", 
NA), col6 = c("\"verylongnamehere\"", NA, "c(\"anotherlongname\",\"yetanotherlongname\")", 
NA)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-4L))

将条件处理为case_when命令:

# separate conditions and results
results <- cond$Result
cond <- trimws(as.matrix(cond[, -1]))

# add default %in% operator for vectors
add.in <- grepl('^c\\(', cond)
cond[add.in] <- paste('%in%', cond[add.in])
# add default ==
add.equals <- grepl('^[^<>%!]', cond)
cond[add.equals] <- paste('==', cond[add.equals])

# add column names to conditions and join them together with ' & '
col.cond <- apply(cond, 1, \(x) {
  col.cond <- paste(colnames(cond), x)[!is.na(x)]
  paste(col.cond, collapse=' & ')
})
# put TRUE where no condition was given (default value)
col.cond[col.cond==''] <- 'TRUE'

# add results and join all together
case.when <- paste0(col.cond, ' ~ "', results, '"', collapse=',\n ')
# complete the case_when()
case.when <- paste('case_when(\n',
               case.when,
               '\n)')

case.when是字符串形式的case_when命令:

cat(case.when)
# case_when(
#  col1 == 1 & col2 == "A" & col6 == "verylongnamehere" ~ "result1",
#  col3 >= 60 & col5 == "DD" ~ "result2",
#  col1 %in% c(2, 3, 4) & col2 %in% c("B","D") & col5 %in% c("BB","CC","DD") & col6 %in% c("anotherlongname","yetanotherlongname") ~ "result3",
#  TRUE ~ "result4" 
# )

现在我们只需要解析、求值并在mutate中使用:

test2 <- test %>% 
  mutate(new_col = eval(parse(text=case.when)))

#   col1 col2 col3    col4 col5                  col6 new_col
# 1    1    A   43 string1   AA      verylongnamehere result1
# 2    2    B   22 string2   BB       anotherlongname result3
# 3    3    C   15 string3   CC    yetanotherlongname result4
# 4    4    D  100 string4   DD hereisanotherlongname result2
# 5    5    E   60 string5   EE thisisthelastlongname result4

根据您的示例,我只考虑了使用&作为逻辑运算符的条件,如果还使用|,则必须在电子表格中为指定逻辑运算符的每个数据列添加另一列(&|)。如果条件更复杂,并且带有括号,则可能无法使用此方法。

相关问题