regex 报表节标题和副标题的正则表达式

q43xntqr  于 2023-08-08  发布在  其他
关注(0)|答案(2)|浏览(94)

编辑与角落案件。
我需要写一个正则表达式,可以匹配部分标题和副标题从一个PDF文件下面的。标题是一个数字,后跟一个句点,然后是一个标题,可以是一个或多个单词,所有大写字母。
(1.范围)或(2.适用文件)将被视为一个标题。
标题下面的段落主体包含子标题,子标题的数字后面跟着升序数字,标题通常以句点结尾。
(1.1范围)、(1.1.1分类)、(1.2定义)和(1.3缩略语、首字母缩写词和符号)是副标题。

100
07482
1. SCOPE
1.1 Scope. This specification establishes general requirements ...
1.1.1 Classification. This specification contains the following ...
1.2 Definitions....
Acceptance Test Procedure – The procedure which defines those tests 
1.3 Abbreviations, Acronyms, and Symbols.

字符串
我需要一个正则表达式,匹配的标题,如果可能的副标题。
x1c 0d1x的数据
当我运行上面的代码时,它可以匹配标题和副标题,但它也会匹配任何以数字开头,以空格或空格后面的单词结尾的文本。
我现在用的是下面显示的可以匹配第一个数字方案和第一个单词的。它适用于1.1节、1.1.1节和1.2节,但如果我有一个标题有多个单词的节,如1.3节,它就不起作用。
^(\d+(?:.\d+))\s([a-zA-Z0-9_ ]
欢迎提出任何建议。
我试过了。
^(\d+(?:.\d+))\s([a-zA-Z0-9_ ]
我希望它能与1.3节的缩写、首字母缩写和符号相匹配。但只匹配了1.3个缩写

6qfn3psc

6qfn3psc1#

Try(regex101):

^\d+(?:\.\d+)*\.?\s*[^.\n\r]+\.?

字符串

  • \d+(?:\.\d+)*-将数字与后面的任何点+数字匹配
  • \.?-可选匹配点
  • [^.\n\r]+\.?-匹配所有内容,直到出现一个点或新的行
import re

text = '''\
1. SCOPE
1.1 Scope. This specification establishes general requirements for Supplier-designed components. A comprehensive definition of the item to be developed will be contained in the Component Detail Design Specification invoking this general specification.
1.1.1 Classification. This specification contains the following class(es). Unless otherwise specified, the requirements herein apply to all Classes.
CLASS A
1.2 Definitions. For the purpose of this specification, the following definitions shall apply:
Acceptance Test Procedure – The procedure which defines those tests performed on the component after manufacture to demonstrate conformance to the specified acceptance requirements and to verify manufacture quality.
Ambient Bay – Atmosphere inside the propulsion system bay.
Blade Out – Used to describe the event when a fan, compressor, or turbine blade liberates from the rotor.
1.3 Abbreviations, Acronyms, and Symbols.'''

for title in re.findall(r'^\d+(?:\.\d+)*\.?\s*[^.\n\r]+\.?', text, flags=re.M):
    print(title)


印刷品:

1. SCOPE
1.1 Scope.
1.1.1 Classification.
1.2 Definitions.
1.3 Abbreviations, Acronyms, and Symbols.


编辑:更新输入(regex101):

import re

text = '''\
100
07482
1. SCOPE
1.1 Scope. This specification establishes general requirements ...
1.1.1 Classification. This specification contains the following ...
1.2 Definitions....
Acceptance Test Procedure – The procedure which defines those tests
1.3 Abbreviations, Acronyms, and Symbols.
'''

for title in re.findall(r'^\d+\.[\d.]* [^.\n\r]+\.?', text, flags=re.M):
    print(title)


印刷品:

1. SCOPE
1.1 Scope.
1.1.1 Classification.
1.2 Definitions.
1.3 Abbreviations, Acronyms, and Symbols.

hgtggwj0

hgtggwj02#

在Windows中,给定pdftotext的文本输出
其中txt是

100
07482
1. SCOPE
1.1 Scope. This specification establishes general requirements ...
1.1.1 Classification. This specification contains the following ...
1.2 Definitions....
Acceptance Test Procedure – The procedure which defines those tests 
1.3 Abbreviations, Acronyms, and Symbols.
20 milliamperes

字符串
等等。
type pdfout.txt |findstr /R "^[0-9]\." >headings.txt
给出了干净的结果

1. SCOPE
1.1 Scope. This specification establishes general requirements ...
1.1.1 Classification. This specification contains the following ...
1.2 Definitions....
1.3 Abbreviations, Acronyms, and Symbols.


这是生硬的,但可行的。对于更大的范围,使用type pdfout.txt |findstr /R "^[0-9]*\."

1. SCOPE
1.1 Scope. This specification establishes general requirements ...
1.1.1 Classification. This specification contains the following ...
1.2 Definitions....
1.3 Abbreviations, Acronyms, and Symbols.
10. Ten
10.1 more..
100. Hundreds too


所以要拖放一个PDF就可以这样的快捷命令了
pdftotext -raw "%~1" pdfout.txt&&type pdfout.txt |findstr /R "^[0-9]*\.">headings.txt
粗糙但对大多数情况有效,因为我们只需要原始条目。

相关问题