R语言按顺序从字符串中提取匹配的单词

9jyewag0 于 2023-02-14 发布在其他

关注(0)|答案(4)|浏览(185)

如果我有两个这样的字符串：

x <- "Here is a test of words and stuff."
y <- "Here is a better test of words and stuff."

是否有一种简单的方法可以从左到右检查单词，并创建一个新的匹配单词字符串，然后在单词不再匹配时停止，这样输出将如下所示：

> "Here is a"

我不想找到两个字符串之间所有匹配的单词，而只想找到按顺序匹配的单词，所以“words and stuff.”在两个字符串中，但我不想选择它。

来源：https://stackoverflow.com/questions/75440694/extract-matching-words-from-strings-in-order

4条答案

按热度按时间

tvz2xvvm1#

拆分字符串，计算两个拆分的最小长度，从每个拆分的开头取相应的单词数，并附加FALSE以确保在匹配相应的单词时可能出现不匹配，然后使用which.min查找第一个不匹配的单词，并将该数字减去1，然后粘贴在一起。

L <- strsplit(c(x, y), " +")
wx <- which.min(c(do.call(`==`, lapply(L, head, min(lengths(L)))), FALSE))
paste(head(L[[1]], wx - 1), collapse = " ")
## [1] "Here is a"

赞(0）回复(0）举报 2023-02-14

70gysomp2#

您可以编写一个helper函数来为您执行检查

common_start<-function(x, y) {
  i <- 1
  last <- NA
  while (i <= nchar(x) & i <= nchar(x)) {
    if (substr(x,i,i) == substr(y,i,i)) {
      if (grepl("[[:space:][:punct:]]", substr(x,i,i), perl=T)) {
        last <- i
      }
    } else {
      break;
    }
    i <- i + 1
  }
  if (!is.na(last)) {
    substr(x, 1, last-1)
  } else {
    NA
  }
}

用它来搅拌样品

common_start(x,y)
# [1] "Here is a"

其思想是检查每一个字符，跟踪最后一个仍然匹配的非单词字符。使用while循环可能并不花哨，但它确实意味着一旦发现不匹配，你可以提前中断，而不必处理整个字符串。

赞(0）回复(0）举报 2023-02-14

cnjp1d6j3#

这将显示匹配的前n个单词：

xvec <- strsplit(x, " +")[[1]]
yvec <- strsplit(y, " +")[[1]]
(len <- min(c(length(xvec), length(yvec))))
# [1] 8
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)))
list(xvec[1:i], yvec[1:i])
# [[1]]
# [1] "Here"   "is"     "a"      "test"   "of"     "words"  "and"    "stuff."
# [[2]]
# [1] "Here"   "is"     "a"      "better" "test"   "of"     "words"  "and"   
cumsum(head(xvec, len) != head(yvec, len))
# [1] 0 0 0 1 2 3 4 5
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)) > 0)
list(xvec[1:(i-1)], yvec[1:(i-1)])
# [[1]]
# [1] "Here" "is"   "a"   
# [[2]]
# [1] "Here" "is"   "a"

从这里，我们可以很容易地推导出前导字符串：

paste(xvec[1:(i-1)], collapse = " ")
# [1] "Here is a"

剩下的字符串

paste(xvec[-(1:(i-1))], collapse = " ")
# [1] "test of words and stuff."

赞(0）回复(0）举报 2023-02-14

jmo0nnb34#

我写了一个函数来检查字符串并返回所需的输出：

x <- "Here is a test of words and stuff."
y <- "Here is a better test of words and stuff."
z <- "This string doesn't match"

library(purrr)

check_str <- function(inp, pat, delimiter = "\\s") {

  inp <- unlist(strsplit(inp, delimiter))
  pat <- unlist(strsplit(pat, delimiter))
  ln_diff <- length(inp) - length(pat)
  
  if (ln_diff < 0) {
    inp <- append(inp, rep("", abs(ln_diff)))
  }
  if (ln_diff > 0) {
    pat <- append(pat, rep("", abs(ln_diff)))
  }
  
  idx <- map2_lgl(inp, pat, ~ identical(.x, .y))
  rle_idx <- rle(idx)
  
  if (rle_idx$values[1]) {
    idx2 <- seq_len(rle_idx$length[1])
  } else {
    idx2 <- 0
  }
  
  paste0(inp[idx2], collapse = delimiter)
}

check_str(x, y, " ")
#> [1] "Here is a"
check_str(x, z, " ")
#> [1] ""

创建于2023年2月13日，使用reprex v2.0.2

赞(0）回复(0）举报 2023-02-14

我来回答

R语言按顺序从字符串中提取匹配的单词

4条答案

相关问题

热门标签

最新问答

R语言 按顺序从字符串中提取匹配的单词

4条答案

相关问题

热门标签

最新问答

R语言按顺序从字符串中提取匹配的单词