regex 如何使用正则表达式检测段和分段

a7qyws3x  于 2023-08-08  发布在  其他
关注(0)|答案(2)|浏览(137)

我想从一篇文章中提取章节和小节以及相应的内容。我有下面的代码使用正则表达式,但代码不能检测到的小节。如何解决?

import re

def extract_sections_and_content(text):
    pattern = r'(?P<section>^\d+(?:\.\d+)*(?:\s+\S+)?\s+[A-Za-z]+)\n(?P<content>(?:(?!\d+(?:\.\d+)*(?:\s+\S+)?\s+[A-Za-z]+|\n\d+(?:\.\d+)*(?:\s+\S+)?\s+[A-Za-z]+).)*)'
    matches = re.finditer(pattern, text, re.DOTALL | re.MULTILINE)
    section_content_pairs = [(re.sub(r'^\d+(?:\.\d+)*(?:\s+\S+)?\s+', '', match.group('section').strip()), match.group('content').strip()) for match in matches]
    return dict(section_content_pairs)

# Example usage:
text = """
1 Introduction
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
2 Background
Praesent euismod, arcu quis fermentum pulvinar, urna ex euismod ex.
2.1.2 Introduction to Topic
Vestibulum nec lorem eu ligula faucibus cursus.
3 Conclusion
Sed sed malesuada magna, at dignissim quam.
"""

result = extract_sections_and_content(text)
print(result)

字符串
我得到的结果:

{'Introduction': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.', 
 'Background': 'Praesent euismod, arcu quis fermentum pulvinar, urna ex euismod ex.', 
 'Conclusion': 'Sed sed malesuada magna, at dignissim quam.'}


我想要的结果是:

{'Introduction': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
 'Background': 'Praesent euismod, arcu quis fermentum pulvinar, urna ex euismod ex.',
 'Introduction to Topic': 'Vestibulum nec lorem eu ligula faucibus cursus.',
 'Conclusion': 'Sed sed malesuada magna, at dignissim quam.'}

nr7wwzry

nr7wwzry1#

您的代码很接近,但正则表达式模式有问题。节的命名捕获组限制太多,并且负先行Assert导致模式与具有子节的节的内容不匹配。
试试这个:

import re

text = """
1 Introduction
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
2 Background
Praesent euismod, arcu quis fermentum pulvinar, urna ex euismod ex.
2.1.2 Introduction to Topic
Vestibulum nec lorem eu ligula faucibus cursus.
3 Conclusion
Sed sed malesuada magna, at dignissim quam.
4 addendum
the details of part 1 can be found below.
"""

pattern = r'(?:(?P<SectionTitle>\d+(\.\d+)* [^\n]+)\n(?P<SectionText>.*?))(?=\n\d+(\.\d+)* [^\n]+|\Z)'

matches = re.finditer(pattern, text, re.DOTALL | re.MULTILINE)

result = {match.group('SectionTitle').split(' ')[1]: match.group('SectionText').strip() for match in matches}

print(result)

字符串

brgchamk

brgchamk2#

你可以替换每一个匹配的

r'^\d+(?:\.\d+)* (.*)\r?\n(.*)'

字符串
(with gm标志设置)与'\1': '\2',然后将结果字符串括在大括号({...})中。
Demo

相关问题