我想从一篇文章中提取章节和小节以及相应的内容。我有下面的代码使用正则表达式,但代码不能检测到的小节。如何解决?
import re
def extract_sections_and_content(text):
pattern = r'(?P<section>^\d+(?:\.\d+)*(?:\s+\S+)?\s+[A-Za-z]+)\n(?P<content>(?:(?!\d+(?:\.\d+)*(?:\s+\S+)?\s+[A-Za-z]+|\n\d+(?:\.\d+)*(?:\s+\S+)?\s+[A-Za-z]+).)*)'
matches = re.finditer(pattern, text, re.DOTALL | re.MULTILINE)
section_content_pairs = [(re.sub(r'^\d+(?:\.\d+)*(?:\s+\S+)?\s+', '', match.group('section').strip()), match.group('content').strip()) for match in matches]
return dict(section_content_pairs)
# Example usage:
text = """
1 Introduction
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
2 Background
Praesent euismod, arcu quis fermentum pulvinar, urna ex euismod ex.
2.1.2 Introduction to Topic
Vestibulum nec lorem eu ligula faucibus cursus.
3 Conclusion
Sed sed malesuada magna, at dignissim quam.
"""
result = extract_sections_and_content(text)
print(result)
字符串
我得到的结果:
{'Introduction': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
'Background': 'Praesent euismod, arcu quis fermentum pulvinar, urna ex euismod ex.',
'Conclusion': 'Sed sed malesuada magna, at dignissim quam.'}
型
我想要的结果是:
{'Introduction': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
'Background': 'Praesent euismod, arcu quis fermentum pulvinar, urna ex euismod ex.',
'Introduction to Topic': 'Vestibulum nec lorem eu ligula faucibus cursus.',
'Conclusion': 'Sed sed malesuada magna, at dignissim quam.'}
型
2条答案
按热度按时间nr7wwzry1#
您的代码很接近,但正则表达式模式有问题。节的命名捕获组限制太多,并且负先行Assert导致模式与具有子节的节的内容不匹配。
试试这个:
字符串
brgchamk2#
你可以替换每一个匹配的
字符串
(with
g
和m
标志设置)与'\1': '\2'
,然后将结果字符串括在大括号({...}
)中。Demo