在R中基于多规则生成组的更好方法

zmeyuzjn 于 2023-02-10 发布在其他

关注(0)|答案(1)|浏览(118)

我有一个包含许多列的数据集，通过每个行值组合，为另一列中的新值确定一组规则。不同的组合是多样的，并非所有列都包含在每个规则中。此外，一些列的微生物名称往往很长。因此，我使用的当前方法（case_when）变得相当凌乱，回顾这些规则变得相当乏味。
我想知道是否有一个更好的方法来做到这一点，更干净，更容易审查？我运行这个数据集有超过70.000观察，所以下面是一个虚拟数据集，可以使用。

col1   col2   col3   col4     col5  col6
1      A      43     string1  AA    verylongnamehere
2      B      22     string2  BB    anotherlongname
3      C      15     string3  CC    yetanotherlongname
4      D      100    string4  DD    hereisanotherlongname
5      E      60     string5  EE    thisisthelastlongname

test <- data.frame(
  col1 = c(1,2,3,4,5),
  col2 = c("A","B","C","D","E"),
  col3 = c(43,22,15,100,60),
  col4 = c("string1","string2","string3","string4","string5"),
  col5 = c("AA","BB","CC","DD","EE"),
  col6 = c("verylongnamehere", "anotherlongname","yetanotherlongname","hereisanotherlongname","thisisthelastlongname")
)

下面的代码是我使用的规则和代码的一个示例：

library(dplyr)

test2 <- test %>%
  mutate(new_col = case_when(
    col1 == 1 & col2 == "A" & col6 == "verylongnamehere" ~ "result1",
    col3 >= 60 & col5 == "DD" ~ "result2",
    col1 %in% c(2,3,4) & 
     col2 %in% c("B","D") & 
     col5 %in% c("BB","CC","DD") & 
     col6 %in% c("anotherlongname","yetanotherlongname") ~ "result3",
    TRUE ~ "result4"
  ))

r

来源：https://stackoverflow.com/questions/75370573/better-way-for-generating-groups-based-on-many-rules-in-r

1条答案

按热度按时间

8yparm6h1#

如果条件是在电子表格中，那么查看它们可能会更容易。下面是如何从电子表格中读取条件并构建case_when。
电子表格表示（conditions.xlsx）：

请注意，==和%in%被视为默认值，此处未明确包含。
加载条件

library(readxl)
cond <- read_excel('conditions.xlsx')

dput(cond)：

structure(list(Result = c("result1", "result2", "result3", "result4"
), col1 = c("1", NA, "c(2, 3, 4)", NA), col2 = c("\"A\"", NA, 
"c(\"B\",\"D\")", NA), col3 = c(NA, ">= 60", NA, NA), col4 = c(NA, 
NA, NA, NA), col5 = c(NA, "\"DD\"", "c(\"BB\",\"CC\",\"DD\")", 
NA), col6 = c("\"verylongnamehere\"", NA, "c(\"anotherlongname\",\"yetanotherlongname\")", 
NA)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-4L))

将条件处理为case_when命令：

# separate conditions and results
results <- cond$Result
cond <- trimws(as.matrix(cond[, -1]))

# add default %in% operator for vectors
add.in <- grepl('^c\\(', cond)
cond[add.in] <- paste('%in%', cond[add.in])
# add default ==
add.equals <- grepl('^[^<>%!]', cond)
cond[add.equals] <- paste('==', cond[add.equals])

# add column names to conditions and join them together with ' & '
col.cond <- apply(cond, 1, \(x) {
  col.cond <- paste(colnames(cond), x)[!is.na(x)]
  paste(col.cond, collapse=' & ')
})
# put TRUE where no condition was given (default value)
col.cond[col.cond==''] <- 'TRUE'

# add results and join all together
case.when <- paste0(col.cond, ' ~ "', results, '"', collapse=',\n ')
# complete the case_when()
case.when <- paste('case_when(\n',
               case.when,
               '\n)')

case.when是字符串形式的case_when命令：

cat(case.when)
# case_when(
#  col1 == 1 & col2 == "A" & col6 == "verylongnamehere" ~ "result1",
#  col3 >= 60 & col5 == "DD" ~ "result2",
#  col1 %in% c(2, 3, 4) & col2 %in% c("B","D") & col5 %in% c("BB","CC","DD") & col6 %in% c("anotherlongname","yetanotherlongname") ~ "result3",
#  TRUE ~ "result4" 
# )

现在我们只需要解析、求值并在mutate中使用：

test2 <- test %>% 
  mutate(new_col = eval(parse(text=case.when)))

#   col1 col2 col3    col4 col5                  col6 new_col
# 1    1    A   43 string1   AA      verylongnamehere result1
# 2    2    B   22 string2   BB       anotherlongname result3
# 3    3    C   15 string3   CC    yetanotherlongname result4
# 4    4    D  100 string4   DD hereisanotherlongname result2
# 5    5    E   60 string5   EE thisisthelastlongname result4

根据您的示例，我只考虑了使用&作为逻辑运算符的条件，如果还使用|，则必须在电子表格中为指定逻辑运算符的每个数据列添加另一列（&或|）。如果条件更复杂，并且带有括号，则可能无法使用此方法。

赞(0）回复(0）举报 2023-02-10

我来回答

在R中基于多规则生成组的更好方法

1条答案

相关问题

热门标签

最新问答