regex 使用Python将引号转换为Latex

2ul0zpep  于 2023-03-09  发布在  Python
关注(0)|答案(3)|浏览(130)
    • tl; dr版本**

我有一个段落可能包含引号(例如"blah blah","this one also"等)。现在我必须在python 3.0的帮助下用乳胶风格的引号(例如"blah blah","this also"等)替换它。

    • 背景**

我有很多纯文本文件(超过100个)。现在我必须制作一个单独的Latex文档,在对这些文件做了一些文本处理后,从这些文件中提取内容。我使用Python 3.0来实现这个目的。现在我可以让其他的东西(像转义字符,节等)工作,但在我不能正确地得到引号。
我可以用regex找到pattern(如here所述),但我如何用给定的pattern替换它呢?我不知道在这种情况下如何使用"re.sub()"函数。因为在我的字符串中可能有多个引号的示例。有一个this问题与此相关,但我如何用python实现它呢?

anauzrmj

anauzrmj1#

设计注意事项

1.我只考虑了常规的"double-quotes"'single-quotes',可能还有其他引号(参见this question

  1. LaTeX结束引号也是单引号-我们不想捕获LaTeX双结束引号(例如'' LaTeX double-quote '')并将其误认为单引号(围绕空)
    1.单词缩写和所有权's包含单引号(例如don'tJohn's)。这些单词的特征是引号两边都有字母字符
    1.正则名词(复数所有权)在单词后有单引号(例如the actresses' roles
    溶液
import re

def texify_single_quote(in_string):
    in_string = ' ' + in_string #Hack (see explanations)
    return re.sub(r"(?<=\s)'(?!')(.*?)'", r"`\1'", in_string)[1:]

def texify_double_quote(in_string):
    return re.sub(r'"(.*?)"', r"``\1''", in_string)

测试

with open("test.txt", 'r') as fd_in, open("output.txt", 'w') as fd_out:
    for line in fd_in.readlines():

        #Test for commutativity
        assert texify_single_quote(texify_double_quote(in_string)) == texify_double_quote(texify_single_quote(in_string))

        line = texify_single_quote(line)
        line = texify_double_quote(line)
        fd_out.write(line)

输入文件(test.txt):

# 'single', 'single', "double"
# 'single', "double", 'single'
# "double", 'single', 'single'
# "double", "double", 'single'
# "double", 'single', "double"
# I'm a 'single' person
# I'm a "double" person?
# Ownership for plural words; the peoples' 'rights'
# John's dog barked 'Woof!', and Fred's parents' 'loving' cat ran away.
# "A double-quoted phrase, with a 'single' quote inside"
# 'A single-quoted phrase with a "double quote" inside, with contracted words such as "don't"'
# 'A single-quoted phrase with a regular noun such as actresses' roles'

输出(output.txt):

# `single', `single', ``double''
# `single', ``double'', `single'
# ``double'', `single', `single'
# ``double'', ``double'', `single'
# ``double'', `single', ``double''
# I'm a `single' person
# I'm a ``double'' person?
# Ownership for plural words; the peoples' `rights'
# John's dog barked `Woof!', and Fred's parents' `loving' cat ran away.
# ``A double-quoted phrase, with a `single' quote inside''
# `A single-quoted phrase with a ``double quote'' inside, with contracted words such as ``don't'''
# `A single-quoted phrase with a regular noun such as actresses' roles'

(* 注解注解被预先添加以停止对帖子输出的格式化!*)

解释

我们将分解这个正则表达式模式(?<=\s)'(?!')(.*?)'

*摘要(?<=\s)'(?!')处理开始单引号,而(.*?)处理引号中的内容。

  • (?<=\s)'positive look-behind,只匹配前面有空格(\s)的单引号。这对于防止匹配can't之类的缩写词非常重要(注意事项3、4)。
  • '(?!')是一个negative look-ahead,并且只匹配而不是后跟另一个单引号的单引号(注意事项2)。
  • this answer中所述,模式(.*?)捕获引号之间的内容,而\1包含捕获内容。
  • 之所以有**“Hack”in_string = ' ' + in_string,是因为正向后查找不**捕获从行首开始的单引号,因此为所有行添加一个空格(然后在返回时使用slicing删除它,return re.sub(...)[1:])解决了这个问题!
wn9m85ua

wn9m85ua2#

正则表达式对于某些任务来说是很好的,但是它们仍然是有限的(阅读this以获得更多信息)。为这个任务编写一个解析器似乎更能减少错误。
我为这个任务创建了一个简单的函数并添加了注解。如果仍然有关于实现的问题,请询问。
代码(online version here):

the_text = '''
This is my \"test\" String
This is my \'test\' String
This is my 'test' String
This is my \"test\" String which has \"two\" quotes
This is my \'test\' String which has \'two\' quotes
This is my \'test\' String which has \"two\" quotes
This is my \"test\" String which has \'two\' quotes
'''

def convert_quotes(txt, quote_type):
    # find all quotes
    quotes_pos = []
    idx = -1

    while True:
        idx = txt.find(quote_type, idx+1)
        if idx == -1:
            break
        quotes_pos.append(idx)

    if len(quotes_pos) % 2 == 1:
        raise ValueError('bad number of quotes of type %s' % quote_type)

    # replace quote with ``
    new_txt = []
    last_pos = -1

    for i, pos in enumerate(quotes_pos):
        # ignore the odd quotes - we dont replace them
        if i % 2 == 1:
            continue
        new_txt += txt[last_pos+1:pos]
        new_txt += '``'
        last_pos = pos

    # append the last part of the string
    new_txt += txt[last_pos+1:]

    return ''.join(new_txt)

print(convert_quotes(convert_quotes(the_text, '\''), '"'))

打印输出:

This is my ``test" String
This is my ``test' String
This is my ``test' String
This is my ``test" String which has ``two" quotes
This is my ``test' String which has ``two' quotes
This is my ``test' String which has ``two" quotes
This is my ``test" String which has ``two' quotes
    • 注意:**分析嵌套引号不明确。

例如:字符串"bob said: "alice said: hello""嵌套在正确的语言上
但是:
字符串"bob said: hi" and "alice said: hello"不是嵌套的。
如果是这种情况,您可能希望首先将这些嵌套引号解析为不同的引号,或者使用括号()来消除嵌套引号的歧义。

6mzjoqzu

6mzjoqzu3#

我搜索了无数的网页,试图找到一个简单的答案。几乎所有我见过的解决方案都假设一对引号。这在我写长篇散文的情况下可能是有问题的。扩展引号可能只有一个开引号,没有一个闭引号。当然,单引号和撇号的情况下,这是一个问题。除此之外,我的引号可以出现在多行中。也许我很天真。但在这里这是我的分解方法。这只适用于英语。
1.你需要把单词开头的双引号换成双反勾号
1.替换后,所有其他双引号将更改为双单引号
1.您需要将单词开头的单引号替换为一个反勾号。单词中出现的所有其他单引号(包括撇号)可以保持不变。
我认为这是简单和优雅的,但我错过了任何边缘情况?
这是我的修复报价函数,它似乎工作在所有的情况下提出了以上。

import re

def fix_quotes(s):
    """
    Replace single and double quotes with their corresponding LaTeX-style
    equivalents, except for apostrophes which are left unchanged.
    """
    # Replace opening and closing double quotes with LaTeX-style equivalents
    s = re.sub(r'\B"\b', '``', s).replace('"',"''")
    # Replace opening single quote with LaTeX-style equivalents
    s = re.sub(r"\B'\b", '`', s)
    
    return s

相关问题