日安
我正在尝试使用pandas和BeautifulSoup从特定网站下载pdf文件。我使用了一个pandas脚本,它可以从我在网上看到的示例网站下载,所以脚本可以工作,但是当我为这个特定的网站运行脚本时,它运行并且没有下载文件。网站在下面。
https://www.gems.gov.za/Healthcare-Providers/GEMS-Netwrk-of-Healthcare-Providers/Primary-Network/Family-Practitioners/REO-Family-Practitioners
有人能帮忙吗?
我使用了一个我在网上找到的脚本(如下所示)来下载一个测试网站的文件。
# Import libraries
import requests
from bs4 import BeautifulSoup
# URL from which pdfs to be downloaded
url = "https://www.gems.gov.za/Healthcare-Providers/GEMS-Netwrk-of-Healthcare-Providers/Specialist-Network/Obstetricians-and-gynaecologists-list/"
# Requests URL and get response object
response = requests.get(url)
# Parse text obtained
soup = BeautifulSoup(response.text, 'html.parser')
# Find all hyperlinks present on webpage
links = soup.find_all('a')
i = 0
# From all links check for pdf link and
# if present download file
for link in links:
if ('.pdf' in link.get('href', [])):
i += 1
print("Downloading file: ", i)
# Get response object for link
response = requests.get(link.get('href'))
# Write content in pdf file
pdf = open("pdf"+str(i)+".pdf", 'wb')
pdf.write(response.content)
pdf.close()
print("File ", i, " downloaded")
print("All PDF files downloaded")
1条答案
按热度按时间yfwxisqw1#
所有链接存储的PDF文件以'.asp?la= en ',所以你需要根据这个条件搜索指向文件的链接,而且所有的链接都不包含对域名的引用,所以对于找到的链接,你需要将链接与域名连接起来。下面是工作代码: