regex 在R中提取以分号分隔的字符串

1szpjjfi  于 2023-08-08  发布在  其他
关注(0)|答案(4)|浏览(113)

我有一个关于研究人员从属关系的列,并且只想从Brazil中提取这些从属关系。
下面是一个示例:

"Department of Neurology, Faculty of Medicine, University of São Paulo, São Paulo, Brazil; Gerontology, School of Arts, Sciences, and Humanities, University of São Paulo, São Paulo, Brazil; Graduate Program in Applied Statistics, University Center of United Metropolitan Colleges, São Paulo, Brazil; School of Health Sciences, Faculty of Medicine and Health Sciences, University of East Anglia, Norwich, United Kingdom; Human Cognitive Neuroscience, Psychology, University of Edinburgh, Edinburgh, United Kingdom; Neurology Division, University Hospital, Federal University of Minas Gerais, Belo Horizonte, Brazil; Neuroimaging Laboratory, School of Medical Sciences, University of Campinas (UNICAMP), Campinas, Brazil; Neurology Department, University of Campinas, Campinas, Brazil"

字符串
预期结果如下:

"Department of Neurology, Faculty of Medicine, University of São Paulo, São Paulo, Brazil; Gerontology, School of Arts, Sciences, and Humanities, University of São Paulo, São Paulo, Brazil; Graduate Program in Applied Statistics, University Center of United Metropolitan Colleges, São Paulo, Brazil; Neurology Division, University Hospital, Federal University of Minas Gerais, Belo Horizonte, Brazil; Neuroimaging Laboratory, School of Medical Sciences, University of Campinas (UNICAMP), Campinas, Brazil; Neurology Department, University of Campinas, Campinas, Brazil"

bgtovc5b

bgtovc5b1#

您可以:

str <- "Department of Neurology, Faculty of Medicine, University of São Paulo, São Paulo, Brazil; Gerontology, School of Arts, Sciences, and Humanities, University of São Paulo, São Paulo, Brazil; Graduate Program in Applied Statistics, University Center of United Metropolitan Colleges, São Paulo, Brazil; School of Health Sciences, Faculty of Medicine and Health Sciences, University of East Anglia, Norwich, United Kingdom; Human Cognitive Neuroscience, Psychology, University of Edinburgh, Edinburgh, United Kingdom; Neurology Division, University Hospital, Federal University of Minas Gerais, Belo Horizonte, Brazil; Neuroimaging Laboratory, School of Medical Sciences, University of Campinas (UNICAMP), Campinas, Brazil; Neurology Department, University of Campinas, Campinas, Brazil"

do.call("paste", 
        strsplit(str, ";") |> 
          lapply(function(x) grep("Brazil", x, value = TRUE)) |>
          c(collapse = ";"))
#> [1] "Department of Neurology, Faculty of Medicine, University of São Paulo, São Paulo, Brazil; Gerontology, School of Arts, Sciences, and Humanities, University of São Paulo, São Paulo, Brazil; Graduate Program in Applied Statistics, University Center of United Metropolitan Colleges, São Paulo, Brazil; Neurology Division, University Hospital, Federal University of Minas Gerais, Belo Horizonte, Brazil; Neuroimaging Laboratory, School of Medical Sciences, University of Campinas (UNICAMP), Campinas, Brazil; Neurology Department, University of Campinas, Campinas, Brazil"

字符串
创建于2023-08-05使用reprex v2.0.2

lztngnrs

lztngnrs2#

你也可以使用正则表达式来删除字符串中不包含'brazil'的部分:

gsub("(^|;)((?![^;]+Brazil).)+", "\\1", str, perl = TRUE)

[1] "Department of Neurology, Faculty of Medicine, University of São Paulo, São Paulo, Brazil; Gerontology, School of Arts, Sciences, and Humanities, University of São Paulo, São Paulo, Brazil; Graduate Program in Applied Statistics, University Center of United Metropolitan Colleges, São Paulo, Brazil; Neurology Division, University Hospital, Federal University of Minas Gerais, Belo Horizonte, Brazil; Neuroimaging Laboratory, School of Medical Sciences, University of Campinas (UNICAMP), Campinas, Brazil; Neurology Department, University of Campinas, Campinas, Brazil"

字符串
其中,

str <- "Department of Neurology, Faculty of Medicine, University of São Paulo, São Paulo, Brazil; Gerontology, School of Arts, Sciences, and Humanities, University of São Paulo, São Paulo, Brazil; Graduate Program in Applied Statistics, University Center of United Metropolitan Colleges, São Paulo, Brazil; School of Health Sciences, Faculty of Medicine and Health Sciences, University of East Anglia, Norwich, United Kingdom; Human Cognitive Neuroscience, Psychology, University of Edinburgh, Edinburgh, United Kingdom; Neurology Division, University Hospital, Federal University of Minas Gerais, Belo Horizonte, Brazil; Neuroimaging Laboratory, School of Medical Sciences, University of Campinas (UNICAMP), Campinas, Brazil; Neurology Department, University of Campinas, Campinas, Brazil"

编辑

要仅提取大学名称,请执行以下操作:

gsub("(^|;)((?!University[^;]+Brazil).)+", "\\1", str, perl = TRUE)

[1] "University of São Paulo, São Paulo, Brazil;University of São Paulo, São Paulo, Brazil;University Center of United Metropolitan Colleges, São Paulo, Brazil;University Hospital, Federal University of Minas Gerais, Belo Horizonte, Brazil;University of Campinas (UNICAMP), Campinas, Brazil;University of Campinas, Campinas, Brazil"


使用stringr

str_c(unlist(str_extract_all(str, "University[^;]+Brazil")), collapse=';')
[1] "University of São Paulo, São Paulo, Brazil;University of São Paulo, São Paulo, Brazil;University Center of United Metropolitan Colleges, São Paulo, Brazil;University Hospital, Federal University of Minas Gerais, Belo Horizonte, Brazil;University of Campinas (UNICAMP), Campinas, Brazil;University of Campinas, Campinas, Brazil"


stringr中前面的代码

str_c(unlist(str_extract_all(str, "[^;]+Brazil")), collapse=';')
[1] "Department of Neurology, Faculty of Medicine, University of São Paulo, São Paulo, Brazil; Gerontology, School of Arts, Sciences, and Humanities, University of São Paulo, São Paulo, Brazil; Graduate Program in Applied Statistics, University Center of United Metropolitan Colleges, São Paulo, Brazil; Neurology Division, University Hospital, Federal University of Minas Gerais, Belo Horizonte, Brazil; Neuroimaging Laboratory, School of Medical Sciences, University of Campinas (UNICAMP), Campinas, Brazil; Neurology Department, University of Campinas, Campinas, Brazil"

3lxsmp7m

3lxsmp7m3#

我们可以使用stringr包。首先,拆分字符串上的“;”字符,字符串为str_split_1。其次,使用str_subset过滤特征向量。最后,使用str_flatten将字符串“展平”到一起

library(stringr)
library(purrr)

authors <- "Department of Neurology, Faculty of Medicine, University of São Paulo, São Paulo, Brazil; Gerontology, School of Arts, Sciences, and Humanities, University of São Paulo, São Paulo, Brazil; Graduate Program in Applied Statistics, University Center of United Metropolitan Colleges, São Paulo, Brazil; School of Health Sciences, Faculty of Medicine and Health Sciences, University of East Anglia, Norwich, United Kingdom; Human Cognitive Neuroscience, Psychology, University of Edinburgh, Edinburgh, United Kingdom; Neurology Division, University Hospital, Federal University of Minas Gerais, Belo Horizonte, Brazil; Neuroimaging Laboratory, School of Medical Sciences, University of Campinas (UNICAMP), Campinas, Brazil; Neurology Department, University of Campinas, Campinas, Brazil"

authors |> 
    str_split_1('; *') |> 
    str_subset('Brazil') |>
    str_flatten(collapse = ';')

[1] "Department of Neurology, Faculty of Medicine, University of São Paulo, São Paulo, Brazil;Gerontology, School of Arts, Sciences, and Humanities, University of São Paulo, São Paulo, Brazil;Graduate Program in Applied Statistics, University Center of United Metropolitan Colleges, São Paulo, Brazil;Neurology Division, University Hospital, Federal University of Minas Gerais, Belo Horizonte, Brazil;Neuroimaging Laboratory, School of Medical Sciences, University of Campinas (UNICAMP), Campinas, Brazil;Neurology Department, University of Campinas, Campinas, Brazil"

字符串

cbwuti44

cbwuti444#

尝试scan + grep,如下所示

paste0(
    grep("Brazil", 
    scan(text = s, sep = ";", what = ""), value = TRUE),
    collapse = ";"
)

字符串
这给了

[1] "Department of Neurology, Faculty of Medicine, University of São Paulo, São Paulo, Brazil; Gerontology, School of Arts, Sciences, and Humanities, University of São Paulo, São Paulo, Brazil; Graduate Program in Applied Statistics, University Center of United Metropolitan Colleges, São Paulo, Brazil; Neurology Division, University Hospital, Federal University of Minas Gerais, Belo Horizonte, Brazil; Neuroimaging Laboratory, School of Medical Sciences, University of Campinas (UNICAMP), Campinas, Brazil; Neurology Department, University of Campinas, Campinas, Brazil"

数据

s <- "Department of Neurology, Faculty of Medicine, University of São Paulo, São Paulo, Brazil; Gerontology, School of Arts, Sciences, and Humanities, University of São Paulo, São Paulo, Brazil; Graduate Program in Applied Statistics, University Center of United Metropolitan Colleges, São Paulo, Brazil; School of Health Sciences, Faculty of Medicine and Health Sciences, University of East Anglia, Norwich, United Kingdom; Human Cognitive Neuroscience, Psychology, University of Edinburgh, Edinburgh, United Kingdom; Neurology Division, University Hospital, Federal University of Minas Gerais, Belo Horizonte, Brazil; Neuroimaging Laboratory, School of Medical Sciences, University of Campinas (UNICAMP), Campinas, Brazil; Neurology Department, University of Campinas, Campinas, Brazil"

相关问题