我有一个很长的文本,大约10k个字符,包含许多部分。我需要根据这些部分把课文分成块。每个块都应该包含一个节。文本模板由标题以“SECTION”开头的节表示|RUBRIQUE n”,其中n是该部分的编号。
这是我的尝试:
import re
def get_text_chunks(text):
section_pattern = r"(SECTION|RUBRIQUE) \d+: .+"
section_headings = re.findall(section_pattern, text)
chunks = re.split(section_pattern, text)
return chunks
long_text = """
This text should be ignored.
RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE
RUBRIQUE 2: HAZARDS IDENTIFICATION
2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).
RUBRIQUE 2: HAZARDS IDENTIFICATION
2.2. Another Classification
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
"""
chunks = get_text_chunks(long_text)
for chunk in chunks:
print(chunk)
print("-----------------------")
但我得到了这样的输出:
This text should be ignored.
-----------------------
RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
-----------------------
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE
-----------------------
RUBRIQUE 2: HAZARDS IDENTIFICATION
-----------------------
2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).
-----------------------
而不是有这样的输出:
RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE
-----------------------
RUBRIQUE 2: HAZARDS IDENTIFICATION
2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).
-----------------------
PS:我的输入文本不以SECTION开头|从第一行开始。所以第一部分应该被忽略。
1条答案
按热度按时间rlcwz9us1#
您可以使用look-ahead来避免分隔符有大小。另外,不要使用捕获组,因为这会在输出列表中产生额外的元素:
对于
[1:]
,忽略分隔符第一次出现之前的文本。