R语言 对包含多个列上的特定字符串的数据行设置子集

x8diyxa7  于 2023-03-10  发布在  其他
关注(0)|答案(2)|浏览(144)

我有一个非常大的数据集,我需要对数据集进行子集化,以便仅保留任何药物列(例如药物1、药物2、药物3等,直到药物50)中包含单词“扑热息痛”的ID。
请帮助〈3

df <- data.frame(id = paste0("ID",1:10),
             Medication1= c("paracetamol", "ibuprofen", "opiate", "sertraline"),
             Medication2= c("Lipitor", "ketamine", "zoloft", "xanax"),
             Medication3= c("ibuprofen", "paracetamol", "Zocor", "Zestril"),
             other= LETTERS[1:10])
dgenwo3n

dgenwo3n1#

使用dplyr commands的一个潜在解决方案是:

library(dplyr)

df <- data.frame(id = paste0("ID",1:4),
                 Medication1= c("paracetamol", "ibuprofen", "opiate", "sertraline"),
                 Medication2= c("Lipitor", "ketamine", "zoloft", "xanax"),
                 Medication3= c("ibuprofen", "paracetamol", "Zocor", "Zestril"),
                 other= c(LETTERS[1:3], "paracetamol"))
df
#>    id Medication1 Medication2 Medication3       other
#> 1 ID1 paracetamol     Lipitor   ibuprofen           A
#> 2 ID2   ibuprofen    ketamine paracetamol           B
#> 3 ID3      opiate      zoloft       Zocor           C
#> 4 ID4  sertraline       xanax     Zestril paracetamol

# only detect "paracetamol" in "Medication" columns
df %>%
  filter(if_any(.cols = starts_with("Medication"),
                .fns = ~grepl("paracetamol", .x)))
#>    id Medication1 Medication2 Medication3 other
#> 1 ID1 paracetamol     Lipitor   ibuprofen     A
#> 2 ID2   ibuprofen    ketamine paracetamol     B

创建于2023年3月10日,使用reprex v2.0.2
要获取包含“扑热息痛”和“扑热息痛”的行,可以使用ignore.case = TRUE

df %>%
  filter(if_any(.cols = starts_with("Medication"),
                .fns = ~grepl("paracetamol", .x, ignore.case = TRUE)))
#>    id Medication1 Medication2 Medication3 other
#> 1 ID1 paracetamol     Lipitor   ibuprofen     A
#> 2 ID2   ibuprofen    ketamine paracetamol     B

如果您希望行具有相同的活性成分但名称不同:

df %>%
  filter(if_any(.cols = starts_with("Medication"),
                .fns = ~grepl("paracetamol|Tylenol", .x, ignore.case = TRUE)))
#>    id Medication1 Medication2 Medication3 other
#> 1 ID1 paracetamol     Lipitor   ibuprofen     A
#> 2 ID2   ibuprofen    ketamine paracetamol     B

如果你有跨越多行的id(例如ID1),它会变得更加复杂,但一个选项是:

library(dplyr)
library(tidyr)
df2 <- data.frame(id = paste0("ID",c(1,2,1,3)),
                 Medication1= c("paracetamol", "ibuprofen", "opiate", "sertraline"),
                 Medication2= c("Lipitor", "ketamine", "zoloft", "xanax"),
                 Medication3= c("ibuprofen", "paracetamol", "Zocor", "Zestril"),
                 other= c(LETTERS[1:3], "paracetamol"))
df2
#>    id Medication1 Medication2 Medication3       other
#> 1 ID1 paracetamol     Lipitor   ibuprofen           A
#> 2 ID2   ibuprofen    ketamine paracetamol           B
#> 3 ID1      opiate      zoloft       Zocor           C
#> 4 ID3  sertraline       xanax     Zestril paracetamol

df2 %>%
  pivot_longer(starts_with("Medication"),
               names_to = "medications") %>%
  group_by(id) %>%
  filter(any(value == "paracetamol")) %>%
  pivot_wider(names_from = medications)
#> # A tibble: 3 × 5
#> # Groups:   id [2]
#>   id    other Medication1 Medication2 Medication3
#>   <chr> <chr> <chr>       <chr>       <chr>      
#> 1 ID1   A     paracetamol Lipitor     ibuprofen  
#> 2 ID2   B     ibuprofen   ketamine    paracetamol
#> 3 ID1   C     opiate      zoloft      Zocor
iovurdzv

iovurdzv2#

在R中,您可以检查整个 Dataframe 是否与单词“扑热息痛”相等,这会给您一个布尔矩阵。由于TRUE == 1FALSE == 0,您可以计算rowSums;很明显你要把大于零的值划分成子集。

df[rowSums(df == 'paracetamol') > 0, ]
#    id Medication1 Medication2 Medication3 other
# 1 ID1 paracetamol     Lipitor   ibuprofen     A
# 2 ID2   ibuprofen    ketamine paracetamol     B

如果数据中包含NA,请使用rowSums(., na.rm=TRUE)

  • 数据:*
df <- structure(list(id = c("ID1", "ID2", "ID3", "ID4"), Medication1 = c("paracetamol", 
"ibuprofen", "opiate", "sertraline"), Medication2 = c("Lipitor", 
"ketamine", "zoloft", "xanax"), Medication3 = c("ibuprofen", 
"paracetamol", "Zocor", "Zestril"), other = c("A", "B", "C", 
"D")), class = "data.frame", row.names = c(NA, -4L))

相关问题