如何优化使用R中嵌套for循环的函数

nhjlsmyf 于 2023-04-09 发布在其他

关注(0)|答案(1)|浏览(104)

我试着创建一个程序，比较一个节点的子序列与一个节点序列，看看子序列出现在每个序列中的频率约80%的正确性。为了做到这一点，我把下面的函数：

code = c(c("A", "B", "C", "D", "E"))
subs = c("A", "C","D")
n = length(code)
m = length(subs)
mat = matrix(0, n+1, m+1)
for(i in 1:(n+1)){
  for(j in 1:(m+1)){
    previ = i - 1
    prevj = j - 1
    if(previ!=0 & prevj!=0){
      if(code[[previ]] == subs[[prevj]]){
        mat[i,j] = mat[previ,prevj] + 1
      }
      else{
        mat[i,j] = max(mat[previ,j], mat[i,prevj])
      }
    }
  }
}
print(mat)
return(mat[n+1,m+1])

但是当我处理的数据集比较大的时候，它的速度非常慢。有没有一种方法可以优化这个循环代码，或者不用循环就可以进行分析？

r

来源：https://stackoverflow.com/questions/75922000/how-can-i-optimize-my-function-that-uses-a-nested-for-loop-in-r

1条答案

按热度按时间

iszxjhcz1#

如果你只关心精确匹配：

将code和subs转换为如下字符串：

code_string = paste(code, collapse = '')

（类似于subs_string）

使用{stringr}查找匹配项：

library(stringr)
str_locate_all(string = 'ABCXABCABC', pattern = 'ABC')

[[1]]
     start end
[1,]     1   3
[2,]     5   7

如果要允许一定量的偏差，您可以：

将codestring分割成frame_size长度的连续段（例如，当查找3个字母的模式时，长度为4）：

frame_size = 4
nc = nchar(code_string)

fragments <- substring(code_string,
                       1:(nc - frame_size + 1),
                       frame_size:nc
                       )

> fragments
[1] "ABCD" "BCDE"

使用每个片段和你的模式之间的Levenshtein距离（相异性）：

library(comparator)

fragments |> 
  ## `Map` = do this for each item of fragments
  Map(f = \(fragment){ifelse(grepl(subs_string, fragment),
                             ## return 0 for exact match:
                             0,
                             ## penalize insertion or substitution of letters
                             ## in `fragment` for later filtering:
                             Levenshtein(insertion = 100,
                                         deletion = 1,
                                         substitution = 100
                                         )(fragment, subs_string)
                             )
  }
)

输出：

$ABCD
[1] 1 ## this score is OK, we only need to delete one letter from 'ABCD'
      ## to match 'ABD'

$BCDE
[1] 101 ## this score tells us that a letter has been inserted and/or
        ## substituted from 'BCDE' which we don't want

现在你可以过滤输出的分数〈2（最多从4个字母的字符串中删除一个以匹配你的3个字母的字符串;这是你的80%相似度）

赞(0）回复(0）举报 2023-04-09

我来回答

如何优化使用R中嵌套for循环的函数

1条答案

相关问题

热门标签

最新问答