regex 检查字符串是否出现在R中的另一个字符串中

j2qf4p5b  于 2023-01-14  发布在  其他
关注(0)|答案(4)|浏览(121)

我有一个杂文包含这样的句子:

df <- tibble(sentences = c("Bob is looking for something", "Adriana has an umbrella", "Michael is looking at..."))

另一个里面有一长串名字:

names <- tibble(names = c("Bob", "Mary", "Michael", "John", "Etc."))

我想看看这些句子是否包含列表中的一个名字,并添加一列来指示是否是这种情况,然后得到下面的tibble:

wanted_df <- tibble(sentences = c("Bob is looking for something", "Adriana has an umbrella", "Michael is looking at..."), check = c(TRUE, FALSE, TRUE))

到目前为止,我已经试过了,但没有成功:

df <- df %>%
mutate(check = grepl(pattern = names$names, x = df$sentences, fixed = TRUE))

并且:

check <- str_detect(names$names %in% df$sentences)

非常感谢你的帮助)

jhkqcmku

jhkqcmku1#

您应该在grepl中构造一个正则表达式:

df %>% 
  mutate(check = grepl(paste(names$names, collapse = "|"), sentences))

# A tibble: 3 × 2
  sentences                    check
  <chr>                        <lgl>
1 Bob is looking for something TRUE 
2 Adriana has an umbrella      FALSE
3 Michael is looking at...     TRUE
euoag5mw

euoag5mw2#

这是一个R基溶液。

inx <- sapply(names$names, \(pat) grepl(pat, df$sentences))
inx
#>        Bob  Mary Michael  John  Etc.
#> [1,]  TRUE FALSE   FALSE FALSE FALSE
#> [2,] FALSE FALSE   FALSE FALSE FALSE
#> [3,] FALSE FALSE    TRUE FALSE FALSE

inx <- rowSums(inx) > 0L
df$check <- inx
df
#> # A tibble: 3 × 2
#>   sentences                    check
#>   <chr>                        <lgl>
#> 1 Bob is looking for something TRUE 
#> 2 Adriana has an umbrella      FALSE
#> 3 Michael is looking at...     TRUE

创建于2023年1月11日,使用reprex v2.0.2

wqsoz72f

wqsoz72f3#

grep和family期望pattern=的长度为1。类似地,str_detect需要相同长度的 * 字符串 *,而不是逻辑向量,因此不能按原样工作。
我们有几个选择:

  • sapply(矩阵),并查看每行是否有一个或多个匹配项:
df %>%
  mutate(check = rowSums(sapply(names$names, grepl, sentences)) > 0)
# # A tibble: 3 × 2
#   sentences                    check
#   <chr>                        <lgl>
# 1 Bob is looking for something TRUE 
# 2 Adriana has an umbrella      FALSE
# 3 Michael is looking at...     TRUE

(我现在明白了,这是鲁伊·巴拉达斯的回答。)

  • 使用fuzzyjoin对数据执行模糊连接:
df %>%
  fuzzyjoin::regex_left_join(names, by = c(sentences = "names")) %>%
  mutate(check = !is.na(names))
# # A tibble: 3 × 3
#   sentences                    names   check
#   <chr>                        <chr>   <lgl>
# 1 Bob is looking for something Bob     TRUE 
# 2 Adriana has an umbrella      NA      FALSE
# 3 Michael is looking at...     Michael TRUE

这种方法的一个优点是,它可以告诉您哪个模式(在names中)进行了匹配。

bxpogfeg

bxpogfeg4#

也许我们可以尝试adist + colSums,如下所示

df %>%
  mutate(check = colSums(adist(names$names, sentences, fixed = FALSE) == 0) > 0)

它给出了

# A tibble: 3 × 2
  sentences                    check
  <chr>                        <lgl>
1 Bob is looking for something TRUE
2 Adriana has an umbrella      FALSE
3 Michael is looking at...     TRUE

相关问题