regex 查找字符串中最长的重复序列

4szc88ey 于 2023-02-25 发布在其他

关注(0)|答案(7)|浏览(163)

我需要找到一个字符串中最长的序列，但这个序列必须重复三次或更多次。例如，如果我的字符串是：

fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld

那么我希望返回值“helloworld”。
我知道有几种方法可以完成这个任务，但我面临的问题是，实际的字符串大得离谱，所以我真的在寻找一种方法，可以及时地完成它。

regex

来源：https://stackoverflow.com/questions/11090289/find-longest-repetitive-sequence-in-a-string

7条答案

按热度按时间

0yg35tkg1#

此问题是longest repeated substring problem的一个变体，其中有一个O（n）-时间算法来解决它使用suffix trees。（如Wikipedia所建议的）是构造后缀树时间复杂度为O（n），需要在所有节点上添加子节点的数目（时间O（n），使用DFS），然后找到树中具有至少三个后代的最深节点（时间O（n），使用DFS）。
也就是说，后缀树是出了名的难以构造，所以在尝试这个实现之前，您可能希望找到一个实现后缀树的Python库，快速搜索一下Google就会找到this library，尽管我不确定这是否是一个好的实现。
另一种选择是将suffix arrays与LCP arrays结合使用。您可以迭代LCP数组中的相邻元素对，取每对元素的最小值，并存储通过这种方法找到的最大值。这将对应于重复至少三次的最长字符串的长度，然后您可以从那里读取字符串本身。
有几种简单的算法可以用来构建后缀数组（Manber-Myers算法运行时间为O（n log n），编写代码并不难），而Kasai的算法构建LCP数组的时间为O（n），编写代码相当简单。
希望这有帮助!

赞(0）回复(0）举报 2023-02-25

ttygqcqt2#

使用defaultdict来统计输入字符串中每个位置开始的每个子字符串。OP不清楚重叠匹配是否应该包括在内，这个暴力方法包括了它们。

from collections import defaultdict

def getsubs(loc, s):
    substr = s[loc:]
    i = -1
    while(substr):
        yield substr
        substr = s[loc:i]
        i -= 1

def longestRepetitiveSubstring(r, minocc=3):
    occ = defaultdict(int)
    # tally all occurrences of all substrings
    for i in range(len(r)):
        for sub in getsubs(i,r):
            occ[sub] += 1

    # filter out all substrings with fewer than minocc occurrences
    occ_minocc = [k for k,v in occ.items() if v >= minocc]

    if occ_minocc:
        maxkey =  max(occ_minocc, key=len)
        return maxkey, occ[maxkey]
    else:
        raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))

印刷品：

('helloworld', 3)

赞(0）回复(0）举报 2023-02-25

j8yoct9x3#

让我们从末尾开始，计算频率，一旦最频繁的元素出现3次或更多，就停止。

from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1)[::-1]:
    substrings=[a[i:i+n] for i in range(len(a)-n+1)]
    freqs=Counter(substrings)
    if freqs.most_common(1)[0][1]>=3:
        seq=freqs.most_common(1)[0][0]
        break
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)

结果：

>>> sequence 'helloworld' of length 10 occurs 3 or more times

- 编辑：**如果你觉得你在处理随机输入，并且公共子字符串应该是小长度的，你最好从小的子字符串开始（如果你需要速度的话），当你找不到至少出现3次的子字符串时就停止：

from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1):
    substrings=[a[i:i+n] for i in range(len(a)-n+1)]
    freqs=Counter(substrings)
    if freqs.most_common(1)[0][1]<3:
        n-=1
        break
    else:
        seq=freqs.most_common(1)[0][0]
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)

结果与上述相同。

赞(0）回复(0）举报 2023-02-25

nfg76nw04#

首先想到的是使用逐渐增大的正则表达式进行搜索：

import re

text = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
largest = ''
i = 1

while 1:
    m = re.search("(" + ("\w" * i) + ").*\\1.*\\1", text)
    if not m:
        break
    largest = m.group(1)
    i += 1

print largest    # helloworld

时间复杂度至少为O（n^2）。

赞(0）回复(0）举报 2023-02-25

np8igboo5#

如果反转输入字符串，则将其提供给(.+)(?:.*\1){2}之类的正则表达式
它应该给你最长的字符串重复3次。（反向捕获组1的答案）

编辑：*

我不得不用这种方式取消，它依赖于第一个匹配，除非它测试了当前长度和最大长度，在迭代循环中，正则表达式不起作用。

赞(0）回复(0）举报 2023-02-25

qlckcl4x6#

在Python中，你可以使用字符串计数方法，我们还使用了一个额外的生成器，它将为我们的示例字符串生成给定长度的所有唯一子字符串。
代码很简单：

test_string2 = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'

def generate_substrings_of_length(this_string, length):
    ''' Generates unique substrings of a given length for a given string'''
    for i in range(len(this_string)-2*length+1):
        yield this_string[i:i+length]

def longest_substring(this_string):
    '''Returns the string with at least two repetitions which has maximum length'''
    max_substring = ''
    for subs_length in range(2, len(this_string) // 2 + 1):
        for substring in generate_substrings_of_length(this_string, subs_length):
            count_occurences = this_string.count(substring)
            if count_occurences > 1 :
                if len(substring) > len(max_substring) :
                    max_substring = substring
    return max_substring

我必须在这里指出（这一点很重要），generate_substrings_of_length生成器并不生成所有特定长度的子字符串，它只会生成比较所需的子字符串，否则会产生一些人为的重复，例如下面的例子：

test_string = "banana"

GS = generate_substrings_of_length(test_string , 2)
for i in GS: print(i)

将导致：

ba
an
na

这就足够我们需要的了。

赞(0）回复(0）举报 2023-02-25

uqdfh47h7#

from collections import Counter

def Longest(string):

    b = []
    le = []

    for i in set(string):

        for j in range(Counter(string)[i]+1): 
            b.append(i* (j+1))

    for i in b:
        if i in string:
            le.append(i)

    return ([s for s in le if len(s)==len(max( le , key = len))])

赞(0）回复(0）举报 2023-02-25

我来回答

regex 查找字符串中最长的重复序列

7条答案

相关问题

热门标签

最新问答