我是Python和一般编程的新手。我的代码从PDF中获取数据。然后迭代返回的行,只拉出满足正则表达式的行。大多数行的下一行都有数据,需要添加到正在迭代的当前行。我写的代码成功地做到了这一点,除了下一行没有数据的行要添加,返回当前行,但跳过下一行。
我可以添加或更改什么来确保它不会跳过行。
需要迭代的行的示例如下:
#Input text
CATS THIRD PARTY PAYMENT 1,664.58 0320 2,130.05
MUTUAL/IBS /GAL /0000010318908
IB TRANSFER TO 2,000.00- 0323 130.05
578441575425 10H32 28662338
FEE-INTER ACCOUNT TRANSFER ## 5.50- 0323 124.55
8419752
IB PAYMENT FROM 9,000.00 0325 9,124.55
JENNIFER LIVINGSTONE
IB PAYMENT FROM 1,000.00 0401 10,124.55
JENNIFER LIVINGSTONE
MONTHLY MANAGEMENT FEE ## 21.00- 0331 10,103.55 (This line has no description in the following line)
CREDIT TRANSFER 9,000.00 0401 19,103.55 (This line gets skipped)
ABSA BANK rent
IB TRANSFER TO 19,000.00- 0403 103.55
578441575425 11H45 286623383
bpdf = 'test pdf.pdf'
with pdfplumber.open(bpdf) as pdf:
page = pdf.pages[0]
text = page.extract_text()
print(text)
new_trn_line = re.compile(r'(\D+)(\d.*) (\d.*) (\d.*\.\d{2})')
def transactions(sentences):
for lines in sentences.split('\n'):
yield lines
my_list = transactions(text)
my_data = []
for each_line in my_list:
if new_trn_line.search(each_line):
my_next_line = next(my_list)
if not new_trn_line.search(my_next_line):
my_data.append(new_trn_line.search(each_line).group(1) + my_next_line + " " +
new_trn_line.search(each_line).group(2) + " " + new_trn_line.search(each_line).group(3))
elif re.search(new_trn_line,text):
my_data.append(each_line)
else:
continue
my_data
#Output
['CATS THIRD PARTY PAYMENT MUTUAL/IBS /GAL /0000010318908 1,664.58 0320',
'IB TRANSFER TO 578441575425 10H32 286623383 2,000.00- 0323',
'FEE-INTER ACCOUNT TRANSFER ## 8419752 5.50- 0323',
'IB PAYMENT FROM JENNIFER LIVINGSTONE 9,000.00 0325',
'IB PAYMENT FROM JENNIFER LIVINGSTONE 1,000.00 0401',
'MONTHLY MANAGEMENT FEE ## 21.00- 0331 10,103.55',
'IB TRANSFER TO 578441575425 11H45 286623383 19,000.00- 0403'\]
如果您将其与输入进行比较,您将看到跳过信用转账9,000.00 0401 19,103.55
1条答案
按热度按时间cwtwac6a1#
当前模式与
(\D+)(\d.*) (\d.*) (\d.\*.\d{2})
不匹配,因为\*
与示例数据中没有的星号匹配。如果您已经阅读了
.pdf
的全部内容,也许可以对捕获组使用单个模式,然后组合结果。模式匹配
^
字符串开始([^\d\n]+)
捕获组1匹配除数字或换行符以外的1+个字符(\d[\d,.-]* \d+)
Capturegroup 2匹配一个数字,后面是数字、逗号、点或连字符的可选重复。然后匹配一个空格和1+个数字\d[\d,.]*\.\d{2}
匹配一个数字,然后匹配数字、逗号和圆点,然后匹配.
,然后匹配2个数字(
捕获组3(?:
非捕获组作为整个部件重复\n(?!.*\b\d[\d,.]*\.\d{2}$)
Assert该行不以小数部分包含一个点和两个数字的数字结束.*
匹配整行)*
关闭非捕获组,并可选地重复以仅匹配前一行)
关闭组3参见regex101 demo
请注意为什么在示例输出中只有
MONTHLY MANAGEMENT FEE
以10,103.55
结尾,因为所有其他输出行似乎都以4位数字结尾。产出