R中最长的公共子串，查找两个字符串之间的非连续匹配

w80xi6nr 于 2023-05-20 发布在其他

关注(0)|答案(7)|浏览(137)

我有一个关于在R中寻找最长公共子串的问题。在搜索StackOverflow上的一些帖子时，我了解了qualV包。然而，我看到这个包中的LCS函数实际上找到了string1中存在于string2中的所有字符，即使它们不连续。
为了解释，如果字符串是string1：“hello”string2：“hel12345lo”我期望输出为hel，但我得到的输出为hello。我一定是做错了什么。请看下面的代码。

library(qualV)
a= "hello"
b="hel123l5678o" 
sapply(seq_along(a), function(i)
    paste(LCS(substring(a[i], seq(1, nchar(a[i])), seq(1, nchar(a[i]))),
              substring(b[i], seq(1, nchar(b[i])), seq(1, nchar(b[i]))))$LCS,
          collapse = ""))

我也尝试了Rlibstree方法，但我仍然得到不连续的子字符串。此外，子字符串的长度也与我的预期不符。

> a = "hello"
> b = "h1e2l3l4o5"

> ll <- list(a,b)
> lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x))
$do.call.rbind..ll.
[1] "h" "e" "l" "o"

> nchar(lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x)))
do.call.rbind..ll.
                21

来源：https://stackoverflow.com/questions/28261825/longest-common-substring-in-r-finding-non-contiguous-matches-between-the-two-str

7条答案

按热度按时间

vuktfyat1#

这里有三种可能的解决方案。

library(stringi)
library(stringdist)

a <- "hello"
b <- "hel123l5678o"

## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"

base R中还有adist()和agrep()，stringdist包中有一些运行LCS方法的函数。下面我们来看看stringsidt。它返回未配对的字符数。

stringdist(a, b, method="lcs")
# [1] 7

Filter("!", mapply(
    stringdist, 
    stri_sub(b, 1, 1:nchar(b)),
    stri_sub(a, 1, 1:nchar(b)),
    MoreArgs = list(method = "lcs")
))
#  h  he hel 
#  0   0   0

现在我已经对此进行了更多的探讨，我认为adist()可能是一条可行的道路。如果我们设置counts=TRUE，我们得到一个匹配，插入等序列。所以如果你把它给stri_locate()，我们就可以用这个矩阵来得到从a到B的匹配。

ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"

因此，M值表示直接匹配。我们可以用stri_sub()得到子串

stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l"   "o"

对不起，我没有解释得很好，因为我不太精通字符串距离算法。

赞(0）回复(0）举报 2023-05-20

gr8qqesn2#

利用@RichardScriven的洞察力adist could be used（它计算“近似字符串距离”。我做了一个功能更全面。请注意，"trafos"代表用于确定两个字符串之间的“距离”的“转换”（底部的示例）

编辑此答案可能会产生错误/意外结果;正如@wdkrnls所指出的：

我对“apple”和“big apple bagels”运行你的函数，它返回“appl”。我以为是“苹果”。
请参阅下面的错误结果解释。我们首先使用一个函数来获取列表中的longest_string：

longest_string <- function(s){return(s[which.max(nchar(s))])}

然后我们可以使用@RichardSriven的工作和stringi库：

library(stringi)
lcsbstr <- function(a,b) { 
  sbstr_locations<- stri_locate_all_regex(drop(attr(adist(a, b, counts=TRUE), "trafos")), "M+")[[1]]
  cmn_sbstr<-stri_sub(longest_string(c(a,b)), sbstr_locations)
  longest_cmn_sbstr <- longest_string(cmn_sbstr)
   return(longest_cmn_sbstr) 
}

或者我们可以重写代码以避免使用任何外部库（仍然使用R的原生adist函数）：

lcsbstr_no_lib <- function(a,b) { 
    matches <- gregexpr("M+", drop(attr(adist(a, b, counts=TRUE), "trafos")))[[1]];
    lengths<- attr(matches, 'match.length')
    which_longest <- which.max(lengths)
    index_longest <- matches[which_longest]
    length_longest <- lengths[which_longest]
    longest_cmn_sbstr  <- substring(longest_string(c(a,b)), index_longest , index_longest + length_longest - 1)
    return(longest_cmn_sbstr ) 
}

以上两个函数都只将'hello '标识为最长的公共子串，而不是'hello r'（无论哪个参数是两者中较长的）：

identical('hello',
    lcsbstr_no_lib('hello', 'hello there'), 
    lcsbstr(       'hello', 'hello there'),
    lcsbstr_no_lib('hello there', 'hello'), 
    lcsbstr(       'hello there', 'hello'))

最后编辑*注意一些奇怪的行为 *，结果如下：

lcsbstr('hello world', 'hello')
#[1] 'hell'

我期待'hello'，但由于转换实际上移动（通过删除）world中的“o”成为hello中的“o”-根据M，只有 hell 部分被认为是匹配：

drop(attr(adist('hello world', 'hello', counts=TRUE), "trafos"))
#[1] "MMMMDDDMDDD"
#[1]  vvvv   v
#[1] "hello world"

使用this Levenstein tool可以观察到这种行为，它给出了两种可能的解决方案，相当于这两种转换

#[1] "MMMMDDDMDDD"
#[1] "MMMMMDDDDDD"

我不知道我们是否可以将adist配置为优先选择一种解决方案？（转换具有相同的“权重”-相同数量的“M”和“D”-不知道如何选择具有更大数量的连续M的转换）
最后，不要忘记adist允许传入ignore.case = TRUE（默认值为FALSE）

adist的"trafos"属性的密钥;从一个字符串到另一个字符串的“转换”：

转换序列作为返回值的“trafos”属性返回，作为具有元素M、I、D和S的字符串，指示匹配、插入、删除和替换

赞(0）回复(0）举报 2023-05-20

iswrvxsc3#

我不知道你做了什么得到你的输出“你好”。基于下面的试错实验，LCS函数似乎将（a）如果一个字符跟在本来是LCS的字符后面，则不会将该字符串视为LCS;（B）找到多个等长的LCS（不像sub（）只找到第一个LCS）;（c）字符串中元素的顺序无关紧要--下面没有说明;以及（B）字符串在LCS调用中的顺序无关紧要--也没有显示。
所以，a的“hello”在B中没有LCS，因为b的“hel”后面跟着一个字符。这是我目前的假设
上面A点：

a= c("hello", "hel", "abcd")
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # "abcd" - perhaps because it has nothing afterwards, unlike hello123...

a= c("hello", "hel", "abcd1") # added 1 to abcd
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # no LCS!, as if anything beyond an otherwise LCS invalidates it

a= c("hello", "hel", "abcd") 
b= c("hello1", "abcd") # added 1 to hello
print(LCS(a, b)[4]) # abcd only, since the b hello1 has a character

上面B点：

a= c("hello", "hel", "abcd") 
b= c("hello", "abcd") 
print(LCS(a, b)[4]) # found both, so not like sub vs gsub of finding first or all

赞(0）回复(0）举报 2023-05-20

tyky79it4#

df <- data.frame(A. = c("Australia", "Network"),
                 B. = c("Austria", "Netconnect"), stringsAsFactors = FALSE)

 auxFun <- function(x) {

   a <- strsplit(x[[1]], "")[[1]]
   b  <- strsplit(x[[2]], "")[[1]]
   lastchar <- suppressWarnings(which(!(a == b)))[1] - 1

   if(lastchar > 0){
     out <- paste0(a[1:lastchar], collapse = "")
   } else {
     out <- ""
   }

   return(out)
 }

 df$C. <- apply(df, 1, auxFun)

 df
 A.         B.    C.
 1 Australia    Austria Austr
 2   Network Netconnect   Net

赞(0）回复(0）举报 2023-05-20

deikduxw5#

使用biostrings：

library(Biostrings)
a= "hello"
b="hel123l5678o"
astr= BString(a)
bstr=BString(b)

pmatchPattern(astr, bstr)

Views on a 12-letter BString subject
Subject: hel123l5678o
views:
      start end width
  [1]     1   3     3 [hel]
  Views on a 5-letter BString pattern
Pattern: hello
views:
      start end width
  [1]     1   3     3 [hel]

所以我做了一个基准测试，虽然我的答案确实做了这件事，实际上给了你更多的信息，它比@Rich Scriven慢500倍。

system.time({
a= "hello"
b="123hell5678o"
rounds=100
for (i in 1:rounds) {
astr= BString(a)
bstr=BString(b)
pmatchPattern(astr, bstr)
}
})

system.time({
  c= "hello"
  d="123hell5678o"
  rounds=100
  for (i in 1:rounds) {
ta <- drop(attr(adist(c, d, counts=TRUE), "trafos"))
stri_sub(d, stri_locate_all_regex(ta, "M+")[[1]])
}
})

   user  system elapsed 
  2.476   0.027   2.510 

   user  system elapsed 
  0.006   0.000   0.005

赞(0）回复(0）举报 2023-05-20

ubbxdtey6#

我根据我的目的改编了@Rich Scriven的答案。目标是在向量中找到最长的公共字符串，而不是两个字符串之间的字符串。最后，可以在数据表中按组使用它。

示例：

library(data.table)
library(stringi)

# create the function ------------------------------------

get.lcs.vector <- function(your.vector) {
  
  # get longest common string
  get.lcs <- function(x, y) {
    # get longest common string
    sb <- stri_sub(y, 1, 1:nchar(y))
    sstr <- na.omit(stri_extract_all_coll(x, sb, simplify=TRUE))
    result <- sstr[which.max(nchar(sstr))]
    return(result)
  }
  combi <- data.table(expand.grid(your.vector, your.vector, stringsAsFactors = F))[Var1 != Var2]
  combi.result <- unique(mapply(get.lcs, combi[[1]], combi[[2]]))
  lcs <- combi.result[which.min(nchar(combi.result))]
  return(lcs)
}

# example of data ------------------------------------

dt <- data.table(AN = c("APILCASERNB", "APILCASELNB", "APILCASEYHANB", 
                        "A15DPGY", "A15DPRD", "A15DPWH", "A15DPDB", "A15DPYW", "A15DPTL", 
                        "A15DP4PGY", "A15DP4PRD", "A15DP4PWH", "A15DP4PDB", "A15DP4PYW", 
                        "A15DP4PTL"),
                 Name = c("Example1", "Example1", "Example1", "Example2", 
                          "Example2", "Example2", "Example2", "Example2", "Example2", "Example3", 
                          "Example3", "Example3", "Example3", "Example3", "Example3"))

dt

##                AN     Name
##  1:   APILCASERNB Example1
##  2:   APILCASELNB Example1
##  3: APILCASEYHANB Example1
##  4:       A15DPGY Example2
##  5:       A15DPRD Example2
##  6:       A15DPWH Example2
##  7:       A15DPDB Example2
##  8:       A15DPYW Example2
##  9:       A15DPTL Example2
## 10:     A15DP4PGY Example3
## 11:     A15DP4PRD Example3
## 12:     A15DP4PWH Example3
## 13:     A15DP4PDB Example3
## 14:     A15DP4PYW Example3
## 15:     A15DP4PTL Example3

# smaller exmaple ------------------------------------

dt.ex <- dt[Name == unique(Name)[1]]
dt.ex

##               AN     Name
## 1:   APILCASERNB Example1
## 2:   APILCASELNB Example1
## 3: APILCASEYHANB Example1

get.lcs.vector(dt.ex$AN)

## [1] "APILCASE"

# you can also start from end like this
stri_reverse(get.lcs.vector(stri_reverse(dt.ex$AN)))


# Example on all data.table ------------------------------------

dt[, AN2 := get.lcs.vector(AN), Name]
dt

##                AN     Name      AN2
##  1:   APILCASERNB Example1 APILCASE
##  2:   APILCASELNB Example1 APILCASE
##  3: APILCASEYHANB Example1 APILCASE
##  4:       A15DPGY Example2    A15DP
##  5:       A15DPRD Example2    A15DP
##  6:       A15DPWH Example2    A15DP
##  7:       A15DPDB Example2    A15DP
##  8:       A15DPYW Example2    A15DP
##  9:       A15DPTL Example2    A15DP
## 10:     A15DP4PGY Example3  A15DP4P
## 11:     A15DP4PRD Example3  A15DP4P
## 12:     A15DP4PWH Example3  A15DP4P
## 13:     A15DP4PDB Example3  A15DP4P
## 14:     A15DP4PYW Example3  A15DP4P
## 15:     A15DP4PTL Example3  A15DP4P

赞(0）回复(0）举报 2023-05-20

rnmwe5a27#

LCSn函数（PTXQC包）可以找到输入向量中所有字符串的最长公共字符串。一个警告是，最短的字符串被用作基，因此在比较多个字符串时，它可能不会给予您想要的结果。但是，比较两个序列是一个很好的选择。

library(PTXQC)
LCSn(c("hello","hello world"))
#[1] "hello"
a= "hello"
b="hel123l5678o" 
LCSn(c(a,b))
#[1] "hel"

赞(0）回复(0）举报 2023-05-20

我来回答

R中最长的公共子串，查找两个字符串之间的非连续匹配

7条答案

相关问题

热门标签

最新问答