如何跨多列使用ifelse和str_detect

yb3bgrhw  于 2023-06-27  发布在  其他
关注(0)|答案(4)|浏览(114)

我有一个dataframe,显示ICD-10代码的人谁死了(死者)。数据框中的每一行对应一个死者,每个死者最多可以列出20种情况作为其死亡的促成因素。我想创建一个新列,显示死者是否有任何糖尿病ICD-10代码(1表示是,0表示否)。糖尿病的代码落在E10-E14内,即,糖尿病的代码必须以以下向量中的任何字符串开始,但是第四位置可以采用不同的值:

diabetes <- c("E10","E11","E12","E13","E14")

这是一个小的,虚构的数据看起来像什么的例子:

original <- structure(list(acond1 = c("E112", "I250", "A419", "E149"), acond2 = c("I255", 
"B341", "F179", "F101"), acond3 = c("I258", "B348", "I10", "I10"
), acond4 = c("I500", "E669", "I694", "R092")), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"))

| acond1| acond2| acond3| acond4|
| - -----|- -----|- -----|- -----|
| E112| I255| I258| I500|
| I250| B341| B348| E669|
| A419| F179| I10| I694|
| E149| F101| I10| R092|
这是我想要的结果:
| acond1| acond2| acond3| acond4|糖尿病|
| - -----|- -----|- -----|- -----|- -----|
| E112| I255| I258| I500|一个|
| I250| B341| B348| E669| 0|
| A419| F179| I10| I694| 0|
| E149| F101| I10| R092|一个|
关于这类问题,还有其他几篇文章(例如,Using if else on a dataframe across multiple columnsStr_detect multiple columns using across),但我似乎不能把它们放在一起。以下是我迄今为止没有成功尝试过的:

library(tidyverse)
library(stringr)

#attempt 1
original %>%
  mutate_at(vars(contains("acond")), ifelse(str_detect(.,paste0("^(", 
  paste(diabetes, collapse = "|"), ")")), 1, 0))

#attempt 2
original %>%
  unite(col = "all_conditions", starts_with("acond"), sep = ", ", remove = FALSE) %>%
  mutate(diabetes = if_else(str_detect(.,paste0("^(", paste(diabetes, collapse = "|"), ")")), 1, 0))

任何帮助将不胜感激。

r1wp621o

r1wp621o1#

library(tidyverse)

diabetes_pattern <- c("E10","E11","E12","E13","E14") %>% 
  str_c(collapse = "|")

original <-
  structure(
    list(
      acond1 = c("E112", "I250", "A419", "E149"),
      acond2 = c("I255", "B341", "F179", "F101"),
      acond3 = c("I258", "B348", "I10", "I10"),
      acond4 = c("I500", "E669", "I694", "R092")
    ),
    row.names = c(NA,-4L),
    class = c("tbl_df", "tbl", "data.frame")
  )

original %>% 
  rowwise() %>% 
  mutate(diabetes = +any(str_detect(string = c_across(everything()), pattern = diabetes_pattern)))
#> # A tibble: 4 x 5
#> # Rowwise: 
#>   acond1 acond2 acond3 acond4 diabetes
#>   <chr>  <chr>  <chr>  <chr>     <int>
#> 1 E112   I255   I258   I500          1
#> 2 I250   B341   B348   E669          0
#> 3 A419   F179   I10    I694          0
#> 4 E149   F101   I10    R092          1

original %>% 
  mutate(diabetes = rowSums(across(.cols = everything(), ~str_detect(.x, diabetes_pattern))))
#> # A tibble: 4 x 5
#>   acond1 acond2 acond3 acond4 diabetes
#>   <chr>  <chr>  <chr>  <chr>     <dbl>
#> 1 E112   I255   I258   I500          1
#> 2 I250   B341   B348   E669          0
#> 3 A419   F179   I10    I694          0
#> 4 E149   F101   I10    R092          1

reprex package(v2.0.1)于2022-01-23创建

vi4fp9gy

vi4fp9gy2#

下面是使用applybase R方法

dia <- paste(c("E10","E11","E12","E13","E14"), collapse="|")

df$diabetes <- apply(df, 1, function(x) any(grepl(dia,x)))*1

df
  acond1 acond2 acond3 acond4 diabetes
1   E112   I255   I258   I500        1
2   I250   B341   B348   E669        0
3   A419   F179    I10   I694        0
4   E149   F101    I10   R092        1

关于dplyr

library(dplyr)

df %>% 
  rowwise() %>% 
  mutate(diabetes=any(grepl(dia,c_across(starts_with("ac"))))*1) %>% 
  ungroup
# A tibble: 4 × 5
  acond1 acond2 acond3 acond4 diabetes
  <chr>  <chr>  <chr>  <chr>     <dbl>
1 E112   I255   I258   I500          1
2 I250   B341   B348   E669          0
3 A419   F179   I10    I694          0
4 E149   F101   I10    R092          1
数据
df <- structure(list(acond1 = c("E112", "I250", "A419", "E149"), acond2 = c("I255", 
"B341", "F179", "F101"), acond3 = c("I258", "B348", "I10", "I10"
), acond4 = c("I500", "E669", "I694", "R092")), class = "data.frame", row.names = c(NA, 
-4L))
up9lanfz

up9lanfz3#

如果我们想使用acrossifelsestr_detect,那么我们可以:
1.用pastecollapsestr_detect创建图案

  1. mutateacross所有列,并使用匿名~ifelse和条件以及.names来控制新列
  2. unite新列
  3. readr包中parse_number技巧
diabetes <- c("E10","E11","E12","E13","E14")

pattern <- paste(diabetes, collapse = "|")

library(tidyverse)

original %>% 
  mutate(across(everything(), ~ifelse(str_detect(., pattern), 1, 0), .names = "new_{col}")) %>% 
  unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ') %>% 
  mutate(diabetes = parse_number(New_Col), .keep="unused")
acond1 acond2 acond3 acond4 diabetes
  <chr>  <chr>  <chr>  <chr>     <dbl>
1 E112   I255   I258   I500          1
2 I250   B341   B348   E669          0
3 A419   F179   I10    I694          0
4 E149   F101   I10    R092          1
63lcw9qa

63lcw9qa4#

我想对这个问题添加一个更新,因为我发现通过dplyr批准的答案需要很长时间才能执行。
相反,您可以对正在查找的原始代码和列进行向量化。

library(tidyverse)
original <-
  structure(
    list(
      acond1 = c("E112", "I250", "A419", "E149"),
      acond2 = c("I255", "B341", "F179", "F101"),
      acond3 = c("I258", "B348", "I10", "I10"),
      acond4 = c("I500", "E669", "I694", "R092")
    ),
    row.names = c(NA,-4L),
    class = c("tbl_df", "tbl", "data.frame")
  )

# vector for your columns & pattern you are looking for,
# this allows you to add or subtract 
# to a vector for the next portion of code.
dia <- c("acond1", "acond2", "acond3", "acond4")
diabetes_pattern <- c("E10","E11","E12","E13","E14")

identified_diabetes <- original |> 
  mutate(diabetes = +(if_any(any_of(dia), \(x) substr(x, 1,3) %in% c(diabetes_pattern))))

这应该会返回所需的输出,但基准测试的速度要快得多。

original %>% 
rowwise() %>% 
mutate(diabetes = any(grepl(dia, c_across(starts_with("ac")))) * 1) %>% ungroup          

replications elapsed
100    0.45

original |> 
  mutate(diabetes = +(if_any(any_of(dia), \(x) substr(x, 1,3) %in% c(diabetes_pattern))))

replications elapsed
100    0.14

虽然这个较小的集合可能很快,但值得注意的是,随着数据集变大(就像我试图在> 250 k行和~100列的df上做的那样),后者是检查这个的更快方法。

相关问题