如何优化使用R中嵌套for循环的函数

nhjlsmyf  于 2023-04-09  发布在  其他
关注(0)|答案(1)|浏览(104)

我试着创建一个程序,比较一个节点的子序列与一个节点序列,看看子序列出现在每个序列中的频率约80%的正确性。为了做到这一点,我把下面的函数:

code = c(c("A", "B", "C", "D", "E"))
subs = c("A", "C","D")
n = length(code)
m = length(subs)
mat = matrix(0, n+1, m+1)
for(i in 1:(n+1)){
  for(j in 1:(m+1)){
    previ = i - 1
    prevj = j - 1
    if(previ!=0 & prevj!=0){
      if(code[[previ]] == subs[[prevj]]){
        mat[i,j] = mat[previ,prevj] + 1
      }
      else{
        mat[i,j] = max(mat[previ,j], mat[i,prevj])
      }
    }
  }
}
print(mat)
return(mat[n+1,m+1])

但是当我处理的数据集比较大的时候,它的速度非常慢。有没有一种方法可以优化这个循环代码,或者不用循环就可以进行分析?

iszxjhcz

iszxjhcz1#

如果你只关心精确匹配:

  • codesubs转换为如下字符串:
code_string = paste(code, collapse = '')

(类似于subs_string)

  • 使用{stringr}查找匹配项:
library(stringr)
str_locate_all(string = 'ABCXABCABC', pattern = 'ABC')
[[1]]
     start end
[1,]     1   3
[2,]     5   7

如果要允许一定量的偏差,您可以:

  • codestring分割成frame_size长度的连续段(例如,当查找3个字母的模式时,长度为4):
frame_size = 4
nc = nchar(code_string)

fragments <- substring(code_string,
                       1:(nc - frame_size + 1),
                       frame_size:nc
                       )
> fragments
[1] "ABCD" "BCDE"
  • 使用每个片段和你的模式之间的Levenshtein距离(相异性):
library(comparator)

fragments |> 
  ## `Map` = do this for each item of fragments
  Map(f = \(fragment){ifelse(grepl(subs_string, fragment),
                             ## return 0 for exact match:
                             0,
                             ## penalize insertion or substitution of letters
                             ## in `fragment` for later filtering:
                             Levenshtein(insertion = 100,
                                         deletion = 1,
                                         substitution = 100
                                         )(fragment, subs_string)
                             )
  }
)

输出:

$ABCD
[1] 1 ## this score is OK, we only need to delete one letter from 'ABCD'
      ## to match 'ABD'

$BCDE
[1] 101 ## this score tells us that a letter has been inserted and/or
        ## substituted from 'BCDE' which we don't want
  • 现在你可以过滤输出的分数〈2(最多从4个字母的字符串中删除一个以匹配你的3个字母的字符串;这是你的80%相似度)

相关问题