python-3.x 在CSV中存储PDF文本时如何避免将文本分散到多行

mbzjlibv 于 2023-05-19 发布在 Python

关注(0)|答案(1)|浏览(120)

我将PDF文本（用pypdf提取）存储在CSV文件中。问题是很少的pdf文件是非常长的，文本传播到多行为那些长的pdf文件，而不是保持一个单一的行。如何让它们保持在一个单独的行？我的输出如下

column1    column2            
 long pdf    hello my
             name is jhone
 short pdf   hello my name is jhone. I haven't any problem for short pdf file

我的代码：

pdf_url ='https://www.snb.ch/en/mmr/speeches/id/ref_20230330_amrtmo/source/ref_20230330_amrtmo.en.pdf'
print("pdf_url: ",pdf_url)
   
# Download the PDF file from the URL
response = requests.get(pdf_url)

# Create an in-memory buffer from the PDF content
pdf_buffer = io.BytesIO(response.content)

# Read the PDF file from the in-memory buffer
pdf = PdfReader(pdf_buffer)
pdf_content = []
# Access the contents of the PDF file
for page_num in range(len(pdf.pages)):
    page = pdf.pages[page_num]
    page = str(page.extract_text())
    pdf_content.append(page)
    
   
   
with open(filename, "a", newline="",  encoding='utf8') as f:
        writer = csv.writer(f)
        writer.writerow([first_author, new_date_str, speech_title,pdf_url,pdf_content])

pdf_content.clear()

python-3.x

来源：https://stackoverflow.com/questions/76207323/while-storing-pdf-text-in-csv-how-to-avoid-spreading-text-to-multiple-row