regex 如何使用正则表达式进行多重替换?

oxcyiej7  于 2023-02-10  发布在  其他
关注(0)|答案(9)|浏览(477)

我可以使用下面的代码创建一个新文件,使用正则表达式将a替换为aa

import re

with open("notes.txt") as text:
    new_text = re.sub("a", "aa", text.read())
    with open("notes2.txt", "w") as result:
        result.write(new_text)

我想知道我是否必须多次使用new_text = re.sub("a", "aa", text.read())这一行,但要用字符串替换我想更改的其他字母,以便更改文本中的多个字母?
也就是说,a--〉aab--〉bbc--〉cc
所以我必须为所有我想修改的字母写一行,或者有一个更简单的方法。也许创建一个翻译的“字典”。我应该把这些字母放入一个数组中吗?如果我这样做了,我不知道如何调用它们。

pftdvrlh

pftdvrlh1#

@nhahtdh给出的答案是正确的,但是我认为这个例子没有规范的例子那么像python,规范的例子使用了比正则表达式操作更透明的代码,并且利用了python内置的数据结构和匿名函数特性。
翻译字典在这个上下文中是有意义的,事实上,Python Cookbook就是这样做的,如这个例子所示(从ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/复制而来)

import re 

def multiple_replace(dict, text):
  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

if __name__ == "__main__": 

  text = "Larry Wall is the creator of Perl"

  dict = {
    "Larry Wall" : "Guido van Rossum",
    "creator" : "Benevolent Dictator for Life",
    "Perl" : "Python",
  } 

  print multiple_replace(dict, text)

因此,在您的例子中,您可以创建一个dict trans = {"a": "aa", "b": "bb"},然后将其与要翻译的文本沿着传递给multiple_replace,基本上该函数所做的全部工作是创建一个包含所有要翻译的正则表达式的大型正则表达式,然后当找到一个正则表达式时,将lambda函数传递给regex.sub以执行翻译字典查找。
您可以在阅读文件时使用此函数,例如:

with open("notes.txt") as text:
    new_text = multiple_replace(replacements, text.read())
with open("notes2.txt", "w") as result:
    result.write(new_text)

我实际上在生产中使用过这种精确的方法,在一个Web抓取任务中,我需要将一年中的月份从捷克语翻译成英语。
正如@nhahtdh指出的,这种方法的一个缺点是它不是无前缀的:作为其他字典键的前缀的字典键将导致该方法中断。

s71maibg

s71maibg2#

可以使用捕获组和反向引用:

re.sub(r"([characters])", r"\1\1", text.read())

将要加倍的字符放在[]之间。对于小写的abc

re.sub(r"([abc])", r"\1\1", text.read())

在替换字符串中,可以引用捕获组()匹配的任何内容,其表示法为\n,其中n是某个整数(不包括0)。\1引用第一个捕获组。还有另一种表示法\g<n>,其中n可以是任何非负整数(允许0);\g<0>将引用表达式匹配的整个文本。
如果要将除新行以外的所有字符加倍:

re.sub(r"(.)", r"\1\1", text.read())

如果要将所有字符(包括新行)加倍:

re.sub(r"(.)", r"\1\1", text.read(), 0, re.S)
chhkpiq4

chhkpiq43#

您可以使用pandas库和replace函数。我给出了一个示例,其中有五个替换:

df = pd.DataFrame({'text': ['Billy is going to visit Rome in November', 'I was born in 10/10/2010', 'I will be there at 20:00']})

to_replace=['Billy','Rome','January|February|March|April|May|June|July|August|September|October|November|December', '\d{2}:\d{2}', '\d{2}/\d{2}/\d{4}']
replace_with=['name','city','month','time', 'date']

print(df.text.replace(to_replace, replace_with, regex=True))

修改后的文本为:

0    name is going to visit city in month
1                      I was born in date
2                 I will be there at time

您可以找到示例here

f0brbegy

f0brbegy4#

如果您的模式本身就是正则表达式,那么其他解决方案都不起作用。
为此,您需要:

def multi_sub(pairs, s):
    def repl_func(m):
        # only one group will be present, use the corresponding match
        return next(
            repl
            for (patt, repl), group in zip(pairs, m.groups())
            if group is not None
        )
    pattern = '|'.join("({})".format(patt) for patt, _ in pairs)
    return re.sub(pattern, repl_func, s)

可用作:

>>> multi_sub([
...     ('a+b', 'Ab'),
...     ('b', 'B'),
...     ('a+', 'A.'),
... ], "aabbaa")  # matches as (aab)(b)(aa)
'AbBA.'

注意,这个解决方案不允许您将捕获组放在正则表达式中,或者在替换中使用它们。

bnl4lu3b

bnl4lu3b5#

使用how to make a 'stringy' class中的技巧,我们可以让一个对象等同于一个字符串,但需要一个额外的sub方法:

import re
class Substitutable(str):
  def __new__(cls, *args, **kwargs):
    newobj = str.__new__(cls, *args, **kwargs)
    newobj.sub = lambda fro,to: Substitutable(re.sub(fro, to, newobj))
    return newobj

这允许使用builder模式,它看起来更漂亮,但是只对预定数量的替换有效。如果你在循环中使用它,就没有必要再创建额外的类了。

>>> h = Substitutable('horse')
>>> h
'horse'
>>> h.sub('h', 'f')
'forse'
>>> h.sub('h', 'f').sub('f','h')
'horse'
n3h0vuf2

n3h0vuf26#

我发现我不得不修改Emmett J.Butler的代码,将lambda函数改为使用myDict.get(mo.group(1),mo.group(1))。使用myDict.get()还提供了在没有找到键时使用默认值的好处。

OIDNameContraction = {
                                'Fucntion':'Func',
                                'operated':'Operated',
                                'Asist':'Assist',
                                'Detection':'Det',
                                'Control':'Ctrl',
                                'Function':'Func'
}

replacementDictRegex = re.compile("(%s)" % "|".join(map(re.escape, OIDNameContraction.keys())))

oidDescriptionStr = replacementDictRegex.sub(lambda mo:OIDNameContraction.get(mo.group(1),mo.group(1)), oidDescriptionStr)
0sgqnhkj

0sgqnhkj7#

如果你处理文件,我有一个简单的python代码关于这个问题。更多信息here

import re 

 def multiple_replace(dictionary, text):
  # Create a regular expression  from the dictionaryary keys

  regex = re.compile("(%s)" % "|".join(map(re.escape, dictionary.keys())))

  # For each match, look-up corresponding value in dictionaryary
  String = lambda mo: dictionary[mo.string[mo.start():mo.end()]]
  return regex.sub(String , text)

if __name__ == "__main__":

dictionary = {
    "Wiley Online Library" : "Wiley",
    "Chemical Society Reviews" : "Chem. Soc. Rev.",
} 

with open ('LightBib.bib', 'r') as Bib_read:
    with open ('Abbreviated.bib', 'w') as Bib_write:
        read_lines = Bib_read.readlines()
        for rows in read_lines:
            #print(rows)
            text = rows
            new_text = multiple_replace(dictionary, text)
            #print(new_text)
            Bib_write.write(new_text)
sg24os4d

sg24os4d8#

基于Eric's great answer,我提出了一个更通用的解决方案,它能够处理捕获组和反向引用:

import re
from itertools import islice

def multiple_replace(s, repl_dict):
    groups_no = [re.compile(pattern).groups for pattern in repl_dict]

    def repl_func(m):
        all_groups = m.groups()

        # Use 'i' as the index within 'all_groups' and 'j' as the main
        # group index.
        i, j = 0, 0

        while i < len(all_groups) and all_groups[i] is None:
            # Skip the inner groups and move on to the next group.
            i += (groups_no[j] + 1)

            # Advance the main group index.
            j += 1

        # Extract the pattern and replacement at the j-th position.
        pattern, repl = next(islice(repl_dict.items(), j, j + 1))

        return re.sub(pattern, repl, all_groups[i])

    # Create the full pattern using the keys of 'repl_dict'.
    full_pattern = '|'.join(f'({pattern})' for pattern in repl_dict)

    return re.sub(full_pattern, repl_func, s)
    • 示例。**使用调用上述内容
s = 'This is a sample string. Which is getting replaced. 1234-5678.'

REPL_DICT = {
    r'(.*?)is(.*?)ing(.*?)ch': r'\3-\2-\1',
    r'replaced': 'REPLACED',
    r'\d\d((\d)(\d)-(\d)(\d))\d\d': r'__\5\4__\3\2__',
    r'get|ing': '!@#'
}

给出:

>>> multiple_replace(s, REPL_DICT)
'. Whi- is a sample str-Th is !@#t!@# REPLACED. __65__43__.'

对于更有效的解决方案,可以创建简单的 Package 器来预先计算groups_nofull_pattern,例如:

import re
from itertools import islice

class ReplWrapper:
    def __init__(self, repl_dict):
        self.repl_dict = repl_dict
        self.groups_no = [re.compile(pattern).groups for pattern in repl_dict]
        self.full_pattern = '|'.join(f'({pattern})' for pattern in repl_dict)

    def get_pattern_repl(self, pos):
        return next(islice(self.repl_dict.items(), pos, pos + 1))

    def multiple_replace(self, s):
        def repl_func(m):
            all_groups = m.groups()

            # Use 'i' as the index within 'all_groups' and 'j' as the main
            # group index.
            i, j = 0, 0

            while i < len(all_groups) and all_groups[i] is None:
                # Skip the inner groups and move on to the next group.
                i += (self.groups_no[j] + 1)

                # Advance the main group index.
                j += 1

            return re.sub(*self.get_pattern_repl(j), all_groups[i])

        return re.sub(self.full_pattern, repl_func, s)

按如下方式使用它:

>>> ReplWrapper(REPL_DICT).multiple_replace(s)
'. Whi- is a sample str-Th is !@#t!@# REPLACED. __65__43__.'
n3h0vuf2

n3h0vuf29#

我不知道为什么大多数的解决方案都试图组合一个正则表达式模式而不是多次替换。这个答案只是为了完整性。
也就是说,这种方法的输出与组合regex方法的输出不同。也就是说,重复的替换可能会使文本随时间演变。但是,以下函数返回的输出与调用unix sed的输出相同:

def multi_replace(rules, data: str) -> str:
    ret = data
    for pattern, repl in rules:
        ret = re.sub(pattern, repl, ret)
    return ret

用法:

RULES = [
    (r'a', r'b'),
    (r'b', r'c'),
    (r'c', r'd'),
]
multi_replace(RULES, 'ab')  # output: dd

使用相同的输入和规则,其他解决方案将输出"bc"。根据您的使用情况,您可能希望或不希望连续替换字符串。在我的情况下,我希望重新构建sed行为。另外,请注意规则的顺序很重要。如果您颠倒规则顺序,此示例也将返回"bc"。
这种解决方案比将模式组合成一个正则表达式要快(快100倍),所以,如果用例允许的话,应该首选重复替换方法。
当然,您可以编译正则表达式模式:

class Sed:
    def __init__(self, rules) -> None:
        self._rules = [(re.compile(pattern), sub) for pattern, sub in rules]

    def replace(self, data: str) -> str:
        ret = data
        for regx, repl in self._rules:
            ret = regx.sub(repl, ret)
        return ret

相关问题