使用pandas从https网站下载pdf文件

anauzrmj  于 2023-06-04  发布在  其他
关注(0)|答案(1)|浏览(633)

日安
我正在尝试使用pandas和BeautifulSoup从特定网站下载pdf文件。我使用了一个pandas脚本,它可以从我在网上看到的示例网站下载,所以脚本可以工作,但是当我为这个特定的网站运行脚本时,它运行并且没有下载文件。网站在下面。
https://www.gems.gov.za/Healthcare-Providers/GEMS-Netwrk-of-Healthcare-Providers/Primary-Network/Family-Practitioners/REO-Family-Practitioners
有人能帮忙吗?
我使用了一个我在网上找到的脚本(如下所示)来下载一个测试网站的文件。

# Import libraries
import requests
from bs4 import BeautifulSoup

# URL from which pdfs to be downloaded
url = "https://www.gems.gov.za/Healthcare-Providers/GEMS-Netwrk-of-Healthcare-Providers/Specialist-Network/Obstetricians-and-gynaecologists-list/"

# Requests URL and get response object
response = requests.get(url)

# Parse text obtained
soup = BeautifulSoup(response.text, 'html.parser')

# Find all hyperlinks present on webpage
links = soup.find_all('a')

i = 0

# From all links check for pdf link and
# if present download file
for link in links:
    if ('.pdf' in link.get('href', [])):
        i += 1
        print("Downloading file: ", i)

        # Get response object for link
        response = requests.get(link.get('href'))

        # Write content in pdf file
        pdf = open("pdf"+str(i)+".pdf", 'wb')
        pdf.write(response.content)
        pdf.close()
        print("File ", i, " downloaded")

print("All PDF files downloaded")
yfwxisqw

yfwxisqw1#

所有链接存储的PDF文件以'.asp?la= en ',所以你需要根据这个条件搜索指向文件的链接,而且所有的链接都不包含对域名的引用,所以对于找到的链接,你需要将链接与域名连接起来。下面是工作代码:

import requests
from bs4 import BeautifulSoup

# URL from which pdfs to be downloaded
url = "https://www.gems.gov.za/Healthcare-Providers/GEMS-Netwrk-of-Healthcare-Providers/Specialist-Network/Obstetricians-and-gynaecologists-list/"

# Requests URL and get response object
response = requests.get(url)

# Parse text obtained
soup = BeautifulSoup(response.text, 'html.parser')

# Find all hyperlinks present on webpage
links = soup.find_all('a')
i = 0
base_url = "https://www.gems.gov.za"
# From all links check for pdf link and
# if present download file
for link in links:
    href = link.get("href")
    try:
        if href.split(".")[-1] == 'ashx?la=en':
            i += 1
            print("Downloading file: ", i)
            response = requests.get(f"{base_url}{href}")
            
            # Write content in pdf file
            pdf = open("pdf"+str(i)+".pdf", 'wb')
            pdf.write(response.content)
            pdf.close()
            print("File ", i, " downloaded")
    except AttributeError:
        pass

相关问题