R语言 按顺序从字符串中提取匹配的单词

9jyewag0  于 2023-02-14  发布在  其他
关注(0)|答案(4)|浏览(185)

如果我有两个这样的字符串:

x <- "Here is a test of words and stuff."
y <- "Here is a better test of words and stuff."

是否有一种简单的方法可以从左到右检查单词,并创建一个新的匹配单词字符串,然后在单词不再匹配时停止,这样输出将如下所示:

> "Here is a"

我不想找到两个字符串之间所有匹配的单词,而只想找到按顺序匹配的单词,所以“words and stuff.”在两个字符串中,但我不想选择它。

tvz2xvvm

tvz2xvvm1#

拆分字符串,计算两个拆分的最小长度,从每个拆分的开头取相应的单词数,并附加FALSE以确保在匹配相应的单词时可能出现不匹配,然后使用which.min查找第一个不匹配的单词,并将该数字减去1,然后粘贴在一起。

L <- strsplit(c(x, y), " +")
wx <- which.min(c(do.call(`==`, lapply(L, head, min(lengths(L)))), FALSE))
paste(head(L[[1]], wx - 1), collapse = " ")
## [1] "Here is a"
70gysomp

70gysomp2#

您可以编写一个helper函数来为您执行检查

common_start<-function(x, y) {
  i <- 1
  last <- NA
  while (i <= nchar(x) & i <= nchar(x)) {
    if (substr(x,i,i) == substr(y,i,i)) {
      if (grepl("[[:space:][:punct:]]", substr(x,i,i), perl=T)) {
        last <- i
      }
    } else {
      break;
    }
    i <- i + 1
  }
  if (!is.na(last)) {
    substr(x, 1, last-1)
  } else {
    NA
  }
}

用它来搅拌样品

common_start(x,y)
# [1] "Here is a"

其思想是检查每一个字符,跟踪最后一个仍然匹配的非单词字符。使用while循环可能并不花哨,但它确实意味着一旦发现不匹配,你可以提前中断,而不必处理整个字符串。

cnjp1d6j

cnjp1d6j3#

这将显示匹配的前n个单词:

xvec <- strsplit(x, " +")[[1]]
yvec <- strsplit(y, " +")[[1]]
(len <- min(c(length(xvec), length(yvec))))
# [1] 8
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)))
list(xvec[1:i], yvec[1:i])
# [[1]]
# [1] "Here"   "is"     "a"      "test"   "of"     "words"  "and"    "stuff."
# [[2]]
# [1] "Here"   "is"     "a"      "better" "test"   "of"     "words"  "and"   
cumsum(head(xvec, len) != head(yvec, len))
# [1] 0 0 0 1 2 3 4 5
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)) > 0)
list(xvec[1:(i-1)], yvec[1:(i-1)])
# [[1]]
# [1] "Here" "is"   "a"   
# [[2]]
# [1] "Here" "is"   "a"

从这里,我们可以很容易地推导出前导字符串:

paste(xvec[1:(i-1)], collapse = " ")
# [1] "Here is a"

剩下的字符串

paste(xvec[-(1:(i-1))], collapse = " ")
# [1] "test of words and stuff."
jmo0nnb3

jmo0nnb34#

我写了一个函数来检查字符串并返回所需的输出:

x <- "Here is a test of words and stuff."
y <- "Here is a better test of words and stuff."
z <- "This string doesn't match"

library(purrr)

check_str <- function(inp, pat, delimiter = "\\s") {

  inp <- unlist(strsplit(inp, delimiter))
  pat <- unlist(strsplit(pat, delimiter))
  ln_diff <- length(inp) - length(pat)
  
  if (ln_diff < 0) {
    inp <- append(inp, rep("", abs(ln_diff)))
  }
  if (ln_diff > 0) {
    pat <- append(pat, rep("", abs(ln_diff)))
  }
  
  idx <- map2_lgl(inp, pat, ~ identical(.x, .y))
  rle_idx <- rle(idx)
  
  if (rle_idx$values[1]) {
    idx2 <- seq_len(rle_idx$length[1])
  } else {
    idx2 <- 0
  }
  
  paste0(inp[idx2], collapse = delimiter)
}

check_str(x, y, " ")
#> [1] "Here is a"
check_str(x, z, " ")
#> [1] ""

创建于2023年2月13日,使用reprex v2.0.2

相关问题