我正在尝试做一个python程序,它接受一个CSV文件,该文件每行包含一个链接(链接在最后一行)。当我尝试运行我的程序时,我埃弗利都得到这个错误:“项目download()
失败,未找到”的连接适配器。代码为:
import csv
from newspaper import Article
import spacy
import requests
nlp = spacy.load("en_core_web_sm")
URL_COLUMN_INDEX = 4
OUTPUT_FILE_PATH = "output.csv"
visited_urls = set()
with open("20230327113000.export.CSV", "r", encoding="ISO-8859-1") as infile, open(OUTPUT_FILE_PATH, "w", newline="") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
if reader.line_num == 1:
continue
url = row[-1]
if url in visited_urls:
continue
try:
article = Article(url)
article.download()
article.parse()
except Exception as e:
print(f"Error processing article: {url}")
print(e)
continue
summary = article.summary
doc = nlp(summary)
entities = [ent.text for ent in doc.ents]
output_row = [url] + entities
writer.writerow(output_row)
visited_urls.add(url)
print(f"Processed article: {url}")
我曾经试过用newspaper和beautifulSoup做一个程序,结果都一样。
1条答案
按热度按时间llmtgqce1#
我经常使用 * newspaper 3 k *。下面是我写的关于如何使用 * newspaper 3 k * 的GitHub repository。
基于你的错误:
“项目下载()失败,未找到”的连接适配器。
我假设
Article(url)
看起来像这样:文章(https://www.somesite.com)
它应该看起来像这样:
文章(“{url}”)
也有可能您的URL在CSV文件中的格式不正确。URL必须采用以下格式:
LMK如果这解决了你的问题。
您还应该考虑将一些HTTP配置参数传递给
newspaper.Article