regex 正则表达式停止跟随某个字符

ghhaqwfi  于 2023-08-08  发布在  其他
关注(0)|答案(2)|浏览(85)

我正试图从实验室数据库中提取一些抗菌剂数据。
微生物名称代码和敏感性/耐药性模式均包含在result_text下的单个字符串中。
有机体总是6个大写字母长,并且总是在开头有一个星号。
抗菌剂名称通常为3个大写字母,后面是R或r(代表耐药)、S或s(代表敏感)等,用于各种编码。

library(tidyverse)

df <- tibble(
  sample_date_time = as_datetime("2021-01-01 03:00:00"),
  sample_no = "BC11001",
  result_text = "[<*ESCCOL/2/I   /AMPR/>=32/AUGR/16/>]"
)

df %>%
  separate_longer_delim(result_text,  delim = "/") %>% 
  mutate(
    org = str_extract(result_text, "\\*([A-Z]{6})", group = 1),
    antimicrobial = str_extract(result_text, "(?<!\\*)[A-Z]{3}"),
    sens = str_extract(result_text, "([A-Z]{3})(R|r|S|s|I|i|P|p|N|n)", group = 2)
  ) 
#> # A tibble: 8 × 6
#>   sample_date_time    sample_no result_text org    antimicrobial sens 
#>   <dttm>              <chr>     <chr>       <chr>  <chr>         <chr>
#> 1 2021-01-01 03:00:00 BC11001   "[<*ESCCOL" ESCCOL SCC           <NA> 
#> 2 2021-01-01 03:00:00 BC11001   "2"         <NA>   <NA>          <NA> 
#> 3 2021-01-01 03:00:00 BC11001   "I   "      <NA>   <NA>          <NA> 
#> 4 2021-01-01 03:00:00 BC11001   "AMPR"      <NA>   AMP           R    
#> 5 2021-01-01 03:00:00 BC11001   ">=32"      <NA>   <NA>          <NA> 
#> 6 2021-01-01 03:00:00 BC11001   "AUGR"      <NA>   AUG           R    
#> 7 2021-01-01 03:00:00 BC11001   "16"        <NA>   <NA>          <NA> 
#> 8 2021-01-01 03:00:00 BC11001   ">]"        <NA>   <NA>          <NA>

字符串
例如,我试图将第4行中的AMPR提取到单独的列AMPR中。
但是,如果字符串是 *[A-Z]{6}模式,例如第1行中的 *ESCCOL,我希望正则表达式忽略匹配。
我以为对 * 使用负面回顾可以解决问题,但正如您在报告中看到的那样,SCC仍被提取到抗微生物列中,而我实际上需要NA。
有人能告诉我我错在哪里吗?阅读了很多帖子/网站后,仍然有点困惑。
创建于2023-07-31带有reprex v2.0.2

k7fdbhmy

k7fdbhmy1#

您可以将ifelse()test = !str_detect(result_text, "\\*")一起使用,以便仅在result_text中不存在星号时提取这三个字母。

df %>%
  separate_longer_delim(result_text,  delim = "/") %>%
  mutate(
    org = str_extract(result_text, "\\*([A-Z]{6})", group = 1),
    antimicrobial = ifelse(
      !str_detect(result_text, "\\*"),
      str_extract(result_text, "[A-Z]{3}"),
      NA
    ),
    sens = str_extract(result_text, "([A-Z]{3})(R|r|S|s|I|i|P|p|N|n)", group = 2)
  )
#> # A tibble: 8 × 6
#>   sample_date_time    sample_no result_text org    antimicrobial sens 
#>   <dttm>              <chr>     <chr>       <chr>  <chr>         <chr>
#> 1 2021-01-01 03:00:00 BC11001   "[<*ESCCOL" ESCCOL <NA>          <NA> 
#> 2 2021-01-01 03:00:00 BC11001   "2"         <NA>   <NA>          <NA> 
#> 3 2021-01-01 03:00:00 BC11001   "I   "      <NA>   <NA>          <NA> 
#> 4 2021-01-01 03:00:00 BC11001   "AMPR"      <NA>   AMP           R    
#> 5 2021-01-01 03:00:00 BC11001   ">=32"      <NA>   <NA>          <NA> 
#> 6 2021-01-01 03:00:00 BC11001   "AUGR"      <NA>   AUG           R    
#> 7 2021-01-01 03:00:00 BC11001   "16"        <NA>   <NA>          <NA> 
#> 8 2021-01-01 03:00:00 BC11001   ">]"        <NA>   <NA>          <NA>

字符串

qpgpyjmq

qpgpyjmq2#

使用str_extract。请注意,我修改了示例,在字符串中添加了小写 sens

library(dplyr)
library(tidyr)
library(stringr)

df %>% 
  separate_longer_delim(result_text, "/") %>% 
  mutate(org = str_extract(result_text, "\\*.*([A-Z]{6})", group = 1), 
         antimicrobial = str_extract(result_text, "^[A-Z]{3}"),
         sens = str_extract(result_text, "^[A-Z]{3}([RrSsIiPpNn])", group = 1))
# A tibble: 15 × 6
   sample_date_time    sample_no result_text org    antimicrobial sens 
   <dttm>              <chr>     <chr>       <chr>  <chr>         <chr>
 1 2021-01-01 03:00:00 BC11001   "[<*ESCCOL" ESCCOL NA            NA   
 2 2021-01-01 03:00:00 BC11001   "2"         NA     NA            NA   
 3 2021-01-01 03:00:00 BC11001   "I   "      NA     NA            NA   
 4 2021-01-01 03:00:00 BC11001   "AMPR"      NA     AMP           R    
 5 2021-01-01 03:00:00 BC11001   ">=32"      NA     NA            NA   
 6 2021-01-01 03:00:00 BC11001   "AUGR"      NA     AUG           R    
 7 2021-01-01 03:00:00 BC11001   "16"        NA     NA            NA   
 8 2021-01-01 03:00:00 BC11001   "><*ESCCOL" ESCCOL NA            NA   
 9 2021-01-01 03:00:00 BC11001   "2"         NA     NA            NA   
10 2021-01-01 03:00:00 BC11001   "I   "      NA     NA            NA   
11 2021-01-01 03:00:00 BC11001   "AMPs"      NA     AMP           s    
12 2021-01-01 03:00:00 BC11001   ">=32"      NA     NA            NA   
13 2021-01-01 03:00:00 BC11001   "AUGr"      NA     AUG           r    
14 2021-01-01 03:00:00 BC11001   "16"        NA     NA            NA   
15 2021-01-01 03:00:00 BC11001   ">]"        NA     NA            NA

字符串

数据

df <- structure(list(sample_date_time = structure(1609470000, class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), sample_no = "BC11001", 
result_text = "[<*ESCCOL/2/I   /AMPR/>=32/AUGR/16/><*ESCCOL/2/I   /AMPs/>=32/AUGr/16/>]"), 
row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))

相关问题