在python中使用regex返回行,某些行被跳过

4szc88ey  于 2023-02-14  发布在  Python
关注(0)|答案(1)|浏览(163)

我是Python和一般编程的新手。我的代码从PDF中获取数据。然后迭代返回的行,只拉出满足正则表达式的行。大多数行的下一行都有数据,需要添加到正在迭代的当前行。我写的代码成功地做到了这一点,除了下一行没有数据的行要添加,返回当前行,但跳过下一行。
我可以添加或更改什么来确保它不会跳过行。
需要迭代的行的示例如下:

#Input text

CATS THIRD PARTY PAYMENT 1,664.58 0320 2,130.05
MUTUAL/IBS /GAL /0000010318908
IB TRANSFER TO 2,000.00- 0323 130.05
578441575425   10H32 28662338
FEE-INTER ACCOUNT TRANSFER ## 5.50- 0323 124.55
8419752
IB PAYMENT FROM 9,000.00 0325 9,124.55
JENNIFER LIVINGSTONE
IB PAYMENT FROM 1,000.00 0401 10,124.55
JENNIFER LIVINGSTONE
MONTHLY MANAGEMENT FEE ## 21.00- 0331 10,103.55 (This line has no description in the following line)
CREDIT TRANSFER 9,000.00 0401 19,103.55 (This line gets skipped)
ABSA BANK rent
IB TRANSFER TO 19,000.00- 0403 103.55
578441575425   11H45 286623383

bpdf = 'test pdf.pdf'

with pdfplumber.open(bpdf) as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    print(text)

new_trn_line = re.compile(r'(\D+)(\d.*) (\d.*) (\d.*\.\d{2})')

def transactions(sentences):
    for lines in sentences.split('\n'):
        yield lines

my_list = transactions(text)

my_data = []

for each_line in my_list:
    if new_trn_line.search(each_line):
        my_next_line = next(my_list)
        if not new_trn_line.search(my_next_line):
            my_data.append(new_trn_line.search(each_line).group(1) + my_next_line + " " +
            new_trn_line.search(each_line).group(2) + " " + new_trn_line.search(each_line).group(3))

    elif re.search(new_trn_line,text):
            my_data.append(each_line)
    else:
        continue

my_data
#Output
['CATS THIRD PARTY PAYMENT MUTUAL/IBS /GAL /0000010318908 1,664.58 0320',
'IB TRANSFER TO 578441575425   10H32 286623383 2,000.00- 0323',
'FEE-INTER ACCOUNT TRANSFER ## 8419752 5.50- 0323',
'IB PAYMENT FROM JENNIFER LIVINGSTONE 9,000.00 0325',
'IB PAYMENT FROM JENNIFER LIVINGSTONE 1,000.00 0401',
'MONTHLY MANAGEMENT FEE ## 21.00- 0331 10,103.55',
'IB TRANSFER TO 578441575425   11H45 286623383 19,000.00- 0403'\]

如果您将其与输入进行比较,您将看到跳过信用转账9,000.00 0401 19,103.55

cwtwac6a

cwtwac6a1#

当前模式与(\D+)(\d.*) (\d.*) (\d.\*.\d{2})不匹配,因为\*与示例数据中没有的星号匹配。
如果您已经阅读了.pdf的全部内容,也许可以对捕获组使用单个模式,然后组合结果。
模式匹配

  • ^字符串开始
  • ([^\d\n]+)捕获组1匹配除数字或换行符以外的1+个字符
  • (\d[\d,.-]* \d+) Capturegroup 2匹配一个数字,后面是数字、逗号、点或连字符的可选重复。然后匹配一个空格和1+个数字
  • \d[\d,.]*\.\d{2}匹配一个数字,然后匹配数字、逗号和圆点,然后匹配.,然后匹配2个数字
  • (捕获组3
  • (?:非捕获组作为整个部件重复
  • \n(?!.*\b\d[\d,.]*\.\d{2}$)Assert该行不以小数部分包含一个点和两个数字的数字结束
  • .*匹配整行
  • )*关闭非捕获组,并可选地重复以仅匹配前一行
  • )关闭组3

参见regex101 demo

import re

pattern = r"^([^\d\n]+)(\d[\d,.-]* \d+) \d[\d,.]*\.\d{2}((?:\n(?!.*\b\d[\d,.]*\.\d{2}$).*)*)"
my_data = []
text = ("CATS THIRD PARTY PAYMENT 1,664.58 0320 2,130.05\n"
            "MUTUAL/IBS /GAL /0000010318908\n"
            "IB TRANSFER TO 2,000.00- 0323 130.05\n"
            "578441575425   10H32 28662338\n"
            "FEE-INTER ACCOUNT TRANSFER ## 5.50- 0323 124.55\n"
            "8419752\n"
            "IB PAYMENT FROM 9,000.00 0325 9,124.55\n"
            "JENNIFER LIVINGSTONE\n"
            "IB PAYMENT FROM 1,000.00 0401 10,124.55\n"
            "JENNIFER LIVINGSTONE\n"
            "MONTHLY MANAGEMENT FEE ## 21.00- 0331 10,103.55\n"
            "CREDIT TRANSFER 9,000.00 0401 19,103.55\n"
            "ABSA BANK rent\n"
            "IB TRANSFER TO 19,000.00- 0403 103.55\n"
            "578441575425   11H45 286623383")

matches = re.finditer(pattern, text, re.MULTILINE)

for _, match in enumerate(matches):
    my_data.append(f"{match.group(1)}{match.group(3).strip()} {match.group(2)}")

print(my_data)

请注意为什么在示例输出中只有MONTHLY MANAGEMENT FEE10,103.55结尾,因为所有其他输出行似乎都以4位数字结尾。
产出

[
'CATS THIRD PARTY PAYMENT MUTUAL/IBS /GAL /0000010318908 1,664.58 0320',
'IB TRANSFER TO 578441575425   10H32 28662338 2,000.00- 0323',
'FEE-INTER ACCOUNT TRANSFER ## 8419752 5.50- 0323',
'IB PAYMENT FROM JENNIFER LIVINGSTONE 9,000.00 0325',
'IB PAYMENT FROM JENNIFER LIVINGSTONE 1,000.00 0401',
'MONTHLY MANAGEMENT FEE ##  21.00- 0331',
'CREDIT TRANSFER ABSA BANK rent 9,000.00 0401',
'IB TRANSFER TO 578441575425   11H45 286623383 19,000.00- 0403'
]

相关问题