基于regex表达式对文本进行分块,其中包含

4ioopgfo  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(72)

我有一个很长的文本,大约10k个字符,包含许多部分。我需要根据这些部分把课文分成块。每个块都应该包含一个节。文本模板由标题以“SECTION”开头的节表示|RUBRIQUE n”,其中n是该部分的编号。
这是我的尝试:

import re

def get_text_chunks(text):
    section_pattern = r"(SECTION|RUBRIQUE) \d+: .+"
    section_headings = re.findall(section_pattern, text)
    chunks = re.split(section_pattern, text)

    return chunks

long_text = """
This text should be ignored.
RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE

RUBRIQUE 2: HAZARDS IDENTIFICATION
2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).

RUBRIQUE 2: HAZARDS IDENTIFICATION
2.2. Another Classification
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
"""

chunks = get_text_chunks(long_text)
for chunk in chunks:
    print(chunk)
    print("-----------------------")

但我得到了这样的输出:

This text should be ignored.
-----------------------
RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
-----------------------
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE
-----------------------
RUBRIQUE 2: HAZARDS IDENTIFICATION
-----------------------
2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).
-----------------------

而不是有这样的输出:

RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE
-----------------------
RUBRIQUE 2: HAZARDS IDENTIFICATION

2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).
-----------------------

PS:我的输入文本不以SECTION开头|从第一行开始。所以第一部分应该被忽略。

rlcwz9us

rlcwz9us1#

您可以使用look-ahead来避免分隔符有大小。另外,不要使用捕获组,因为这会在输出列表中产生额外的元素:

section_pattern = r"(?=(?:SECTION|RUBRIQUE) \d+: .+)"
    chunks = re.split(section_pattern, text)[1:]

对于[1:],忽略分隔符第一次出现之前的文本。

相关问题