regex 如何根据给定向量查找字符串中最长的连续字符

f2uvfpb9  于 2023-02-05  发布在  其他
关注(0)|答案(4)|浏览(168)

我在R代码中有以下字符串。

aas <- "QAWDIIKRIDKK"

我想检查字符串中最长的连续片段,它包含下面向量中的字符:

hydrophobic_res <- c("W", "F", "I", "L", "V", "M", "C", "A", "G")

答案是:

AW, II

其他示例:

QFILVMD -> FILVM

我需要它非常快,因为需要测试很多字符串。我怎么在R中做到这一点?

8ljdwjyq

8ljdwjyq1#

一个选项-拆分字符串,将密钥向量中不匹配的元素替换为NA,基于创建的NApaste进行分组,并基于max的最小字符数对元素进行子集化

f1 <- function(str1, matchvec)
{
v1 <- strsplit(str1, "")[[1]]
v1[!v1 %in% matchvec] <- NA
v2 <- tapply(v1, with(rle(!is.na(v1)),
      rep(seq_along(values), lengths)),
   FUN = function(x) paste(x[!is.na(x)], collapse = ""))
unname(v2[nchar(v2) == max(nchar(v2))])

}
  • 测试
> f1(aas, hydrophobic_res)
[1] "AW" "II"
> f1("QFILVMD", hydrophobic_res)
[1] "FILVM"

基于正则表达式的选项-创建模式以删除所有不在matchvec中的字符,其中gsub基于字符数进行拆分和子集化

f2 <- function(str1, matchvec)
  {
  pat <- sprintf("[^%s]", paste(matchvec, collapse = ""))
  v1 <- strsplit(gsub(pat, ",", str1), ",")[[1]]
  v1[nchar(v1) == max(nchar(v1))]
}
  • 测试
> f2(aas, hydrophobic_res)
[1] "AW" "II"
> f2("QFILVMD", hydrophobic_res)
[1] "FILVM"
ozxc1zmp

ozxc1zmp2#

下面是一个替代方法:对我来说,通过思考tibles或 Dataframe 来解决此类任务更容易:

library(data.table)
library(dplyr)
str_split(aas, "")[[1]] %>% 
  as_tibble() %>% 
  mutate(flag = grepl(paste(hydrophobic_res, collapse = "|"), value)) %>% 
  group_by(group = rleid(flag==TRUE)) %>% 
  filter(flag == TRUE & max(row_number() > 1)) %>% 
  mutate(string = paste(value, collapse = "")) %>% 
  slice(1) %>% 
  pull(string)
[1] "AW" "II"
vmdwslir

vmdwslir3#

正如你提到的速度很重要,考虑使用stringi,它针对这类任务进行了优化。优点是它也很容易矢量化:

library(stringi)

find_longest <- function(strng, pat) {
  pats <- if (is.list(pat)) {
    sapply(pat, \(x) stri_join(c("[", x, "]+"), collapse = ""))
  } else {
    stri_join(c("[", pat, "]+"), collapse = "")
  }
  res <- stri_extract_all(strng, regex = pats)
  lapply(res, \(x) {
    nc <- nchar(x)
    x[nc == max(nc)]
  })
}

hydrophobic_res <- c("W", "F", "I", "L", "V", "M", "C", "A", "G")
aas <- "QAWDIIKRIDKK"
aas2 <- "QFILVMD"

find_longest(c(aas, aas2), hydrophobic_res)

[[1]]
[1] "AW" "II"

[[2]]
[1] "FILVM"
dkqlctbz

dkqlctbz4#

我建议这样做。还没有测试过,但由于它使用矢量化操作,它应该会很快。

library(stringr)

get_longest_fragment <- function(aa, res) {
  aa_vec <- str_split_1(aa, "")
  delta <- diff(c(FALSE, aa_vec %in% res))
  
  # find start and end of TRUE stretches
  starts <- which(delta == 1)
  ends   <- which(delta == -1) - 1
  
  len <- ends - starts
  longest <- len == max(len)
  
  # index the aa sequence 
  str_sub(aa, starts[longest], ends[longest])
}

get_longest_fragment(aa_sequence, hydrophobic_res)
#> [1] "AW" "II"

相关问题