当我尝试在python中读取i CSV文件时出错

91zkwejq  于 2023-04-03  发布在  Python
关注(0)|答案(1)|浏览(113)

我正在尝试做一个python程序,它接受一个CSV文件,该文件每行包含一个链接(链接在最后一行)。当我尝试运行我的程序时,我埃弗利都得到这个错误:“项目download()失败,未找到”的连接适配器。代码为:

import csv
from newspaper import Article
import spacy
import requests

nlp = spacy.load("en_core_web_sm")

URL_COLUMN_INDEX = 4

OUTPUT_FILE_PATH = "output.csv"

visited_urls = set()

with open("20230327113000.export.CSV", "r", encoding="ISO-8859-1") as infile, open(OUTPUT_FILE_PATH, "w", newline="") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)

    for row in reader:
        if reader.line_num == 1:
            continue

        url = row[-1]

        if url in visited_urls:
            continue

        try:
            article = Article(url)
            article.download()
            article.parse()
        except Exception as e:
            print(f"Error processing article: {url}")
            print(e)
            continue

        summary = article.summary
        doc = nlp(summary)
        entities = [ent.text for ent in doc.ents]

        output_row = [url] + entities
        writer.writerow(output_row)

        visited_urls.add(url)

        print(f"Processed article: {url}")

我曾经试过用newspaper和beautifulSoup做一个程序,结果都一样。

llmtgqce

llmtgqce1#

我经常使用 * newspaper 3 k *。下面是我写的关于如何使用 * newspaper 3 k * 的GitHub repository
基于你的错误:
“项目下载()失败,未找到”的连接适配器。
我假设Article(url)看起来像这样:
文章(https://www.somesite.com
它应该看起来像这样:
文章(“{url}”)
也有可能您的URL在CSV文件中的格式不正确。URL必须采用以下格式:

LMK如果这解决了你的问题。
您还应该考虑将一些HTTP配置参数传递给newspaper.Article

from newspaper import Config
from newspaper import Article

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

article = Article(f'{url}', config=config)
article.download()
article.parse()
<DO SOMETHING>

相关问题