regex 我的脚本读不出一个数字

z31licg0 于 2023-01-18 发布在其他

关注(0)|答案(2)|浏览(90)

我试图读取'cnpj'，这是一个数字，像这样的“30.114.117/0001-64”在一个pdf文件中，所以这里是我的脚本：

import re
import PyPDF2
import PySimpleGUI as sg

#GUI Window
Layout = [
    [sg.Text("Por favor insira o diretório do seu PDF")],
    [sg.Input(key='file_path'), sg.FileBrowse("Procurar")],
    [sg.Button("Extrair"),sg.Button("Cancelar")],
    [sg.InputText("", key="Output")]]

Janela = sg.Window("Extrator de CPF", Layout, margins=(100,50))

while True:
    evento, valores = Janela.read()
    if evento == sg.WIN_CLOSED or evento == "Cancelar":
        break
    elif evento == "Extrair":

        # OPEN PDF File
        with open(valores["file_path"], 'rb') as f:

            # Create a PDF object
            pdf = PyPDF2.PdfFileReader(f)

            # Iterates by every page from PDF
            lista = []
            for p in range(pdf.getNumPages()):

                # get the number pages of pdf
                texto = pdf.getPage(p).extractText()

                # use a (regex) to look to cpf and cnpj (cpf = "123.456.789-10" and cnpj = '30.114.117/0001-64'
                cnpjs = re.findall(r'\d{2}\.\d{3}\.\d{3}/\d{4}-\d{2}', texto)
                cpfs = re.findall(r'\d{3}\.\d{3}\.\d{3}-\d{2}', texto)

                # print the cpf cnpj numbers that i found in the PDF
                for cpf in cpfs:
                     lista.append(cpf + ',')
                for cnpj in cnpjs:
                    lista.append(cnpj + ',')
        Janela['Output'].update(lista)`

脚本是好的，但在变量'texto'中可能返回文本跳转行，如：
“你的电话号码是31.111.111
/0001-64英寸
当行打破regex找不到数字，我也试图texto =texto.replace("\n", " ")，但没有找到无论如何，有人有一个主意？也许另一个库，可以阅读.
我想从pdf中提取CPF和CNPJ，但文本换行，我无法提取编号

regex

来源：https://stackoverflow.com/questions/75139859/my-script-cant-read-a-number-that-jump-line

2条答案

按热度按时间

c3frrgcw1#

我推荐使用PyMuPDF。它有很多用于文本提取的标志，其中一个用于检测连字符。如果你用它来提取，你的问题应该会消失：

import fitz # PyMuPDF import
doc = fitz.open("your.file")
page = doc[0]  # page 0

text = page.get_text(flags=fitz.TEXT_DEHYPHENATE)

顺便说一句，以上所有内容都不依赖于PDF文件-也适用于XPS、EPUB等。

赞(0）回复(0）举报 2023-01-18

fae0ux8s2#

您使用的PyPDF2已弃用。请移至pypdf。
您还没有分享PDF文件，因此无法检查它是否真的解决了您的问题。然而，pypdf与PyPDF2<2.0.0相比，确实有很多文本提取方面的改进。

import re
import pypdf
import PySimpleGUI as sg

#GUI Window
Layout = [
    [sg.Text("Por favor insira o diretório do seu PDF")],
    [sg.Input(key='file_path'), sg.FileBrowse("Procurar")],
    [sg.Button("Extrair"),sg.Button("Cancelar")],
    [sg.InputText("", key="Output")]]

Janela = sg.Window("Extrator de CPF", Layout, margins=(100,50))

while True:
    evento, valores = Janela.read()
    if evento == sg.WIN_CLOSED or evento == "Cancelar":
        break
    elif evento == "Extrair":

        # OPEN PDF File
        with open(valores["file_path"], 'rb') as f:

            # Create a PDF object
            reader = pypdf.PdfReader(f)

            # Iterates by every page from PDF
            lista = []
            for page in reader.pages:

                # get the number pages of pdf
                texto = page.extract_text()

                # use a (regex) to look to cpf and cnpj (cpf = "123.456.789-10" and cnpj = '30.114.117/0001-64'
                cnpjs = re.findall(r'\d{2}\.\d{3}\.\d{3}/\d{4}-\d{2}', texto)
                cpfs = re.findall(r'\d{3}\.\d{3}\.\d{3}-\d{2}', texto)

                # print the cpf cnpj numbers that i found in the PDF
                for cpf in cpfs:
                     lista.append(cpf + ',')
                for cnpj in cnpjs:
                    lista.append(cnpj + ',')
        Janela['Output'].update(lista)

赞(0）回复(0）举报 2023-01-18

我来回答

regex 我的脚本读不出一个数字

2条答案

相关问题

热门标签

最新问答