regex 我的脚本读不出一个数字

z31licg0  于 2023-01-18  发布在  其他
关注(0)|答案(2)|浏览(91)

我试图读取'cnpj',这是一个数字,像这样的“30.114.117/0001-64”在一个pdf文件中,所以这里是我的脚本:

import re
import PyPDF2
import PySimpleGUI as sg

#GUI Window
Layout = [
    [sg.Text("Por favor insira o diretório do seu PDF")],
    [sg.Input(key='file_path'), sg.FileBrowse("Procurar")],
    [sg.Button("Extrair"),sg.Button("Cancelar")],
    [sg.InputText("", key="Output")]]

Janela = sg.Window("Extrator de CPF", Layout, margins=(100,50))

while True:
    evento, valores = Janela.read()
    if evento == sg.WIN_CLOSED or evento == "Cancelar":
        break
    elif evento == "Extrair":

        # OPEN PDF File
        with open(valores["file_path"], 'rb') as f:

            # Create a PDF object
            pdf = PyPDF2.PdfFileReader(f)

            # Iterates by every page from PDF
            lista = []
            for p in range(pdf.getNumPages()):

                # get the number pages of pdf
                texto = pdf.getPage(p).extractText()

                # use a (regex) to look to cpf and cnpj (cpf = "123.456.789-10" and cnpj = '30.114.117/0001-64'
                cnpjs = re.findall(r'\d{2}\.\d{3}\.\d{3}/\d{4}-\d{2}', texto)
                cpfs = re.findall(r'\d{3}\.\d{3}\.\d{3}-\d{2}', texto)

                # print the cpf cnpj numbers that i found in the PDF
                for cpf in cpfs:
                     lista.append(cpf + ',')
                for cnpj in cnpjs:
                    lista.append(cnpj + ',')
        Janela['Output'].update(lista)`

脚本是好的,但在变量'texto'中可能返回文本跳转行,如:
“你的电话号码是31.111.111
/0001-64英寸
当行打破regex找不到数字,我也试图texto =texto.replace("\n", " "),但没有找到无论如何,有人有一个主意?也许另一个库,可以阅读.
我想从pdf中提取CPF和CNPJ,但文本换行,我无法提取编号

c3frrgcw

c3frrgcw1#

我推荐使用PyMuPDF。它有很多用于文本提取的标志,其中一个用于检测连字符。如果你用它来提取,你的问题应该会消失:

import fitz # PyMuPDF import
doc = fitz.open("your.file")
page = doc[0]  # page 0

text = page.get_text(flags=fitz.TEXT_DEHYPHENATE)

顺便说一句,以上所有内容都不依赖于PDF文件-也适用于XPS、EPUB等。

fae0ux8s

fae0ux8s2#

您使用的PyPDF2已弃用。请移至pypdf
您还没有分享PDF文件,因此无法检查它是否真的解决了您的问题。然而,pypdfPyPDF2<2.0.0相比,确实有很多文本提取方面的改进。

import re
import pypdf
import PySimpleGUI as sg

#GUI Window
Layout = [
    [sg.Text("Por favor insira o diretório do seu PDF")],
    [sg.Input(key='file_path'), sg.FileBrowse("Procurar")],
    [sg.Button("Extrair"),sg.Button("Cancelar")],
    [sg.InputText("", key="Output")]]

Janela = sg.Window("Extrator de CPF", Layout, margins=(100,50))

while True:
    evento, valores = Janela.read()
    if evento == sg.WIN_CLOSED or evento == "Cancelar":
        break
    elif evento == "Extrair":

        # OPEN PDF File
        with open(valores["file_path"], 'rb') as f:

            # Create a PDF object
            reader = pypdf.PdfReader(f)

            # Iterates by every page from PDF
            lista = []
            for page in reader.pages:

                # get the number pages of pdf
                texto = page.extract_text()

                # use a (regex) to look to cpf and cnpj (cpf = "123.456.789-10" and cnpj = '30.114.117/0001-64'
                cnpjs = re.findall(r'\d{2}\.\d{3}\.\d{3}/\d{4}-\d{2}', texto)
                cpfs = re.findall(r'\d{3}\.\d{3}\.\d{3}-\d{2}', texto)

                # print the cpf cnpj numbers that i found in the PDF
                for cpf in cpfs:
                     lista.append(cpf + ',')
                for cnpj in cnpjs:
                    lista.append(cnpj + ',')
        Janela['Output'].update(lista)

相关问题