python BeautifulSoup没有< a>从动态网站中提取标签[重复]

mmvthczy 于 2023-04-04 发布在 Python

关注(0)|答案(1)|浏览(142)

此问题在此处已有答案：

Can bs4 get the dynamic content of a webpage if requests can't?（2个答案）
16小时前关门了。
截至15小时前，社区正在审查是否重新讨论这个问题。
我有一个网站https://dip.bundestag.de/aktivit%C3%A4t/Dr--Holger-Becker-MdB-SPD/1628877，我想提取连接到“BT-Plenarprotokoll 20/86，S. 10313 C”的HTML。HTML块是：

<a title="PDF Bundestags-Plenarprotokoll öffnen" aria-label="BT-Plenarprotokoll" href="https://dserver.bundestag.de/btp/20/20086.pdf#P.10313" target="_self" class="hsbfb4-0 sc-1xaeas4-1 hTYfHF FZiNn"><svg viewBox="0 0 10 12" class="sc-1c5ggr5-17 cYBAUx"><g stroke="currentColor" fill="none" fill-rule="evenodd"><path d="M6.14.5H.5v11h9V3.86z"></path><path d="M5.56 2.01v2.51H9.5"></path></g></svg><span class="sc-1xaeas4-3 iZuhXx">BT-Plenarprotokoll 20/86, S. 10313C</span></a>

由于任何原因，BeautifulSoup无法识别此网页上的任何标签。我尝试了不同的代码：

from bs4 import BeautifulSoup

   
url = "https://dip.bundestag.de/aktivit%C3%A4t/Dr--Holger-Becker-MdB-SPD/1628877"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the anchor tag with title and aria-label attributes and extract the href attribute
a_tag = soup.findAll('a', {'title': 'PDF Bundestags-Plenarprotokoll öffnen', 'aria-label': 'BT-Plenarprotokoll'})

和

url = "https://dip.bundestag.de/aktivit%C3%A4t/Dr--Holger-Becker-MdB-SPD/1628877"

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the anchor tag with title and aria-label attributes and extract the href attribute
a_tag = soup.findAll('a')

在这两种情况下，a_tag都是一个空对象，我不明白，因为这个网页有多个链接。

python

来源：https://stackoverflow.com/questions/75921167/beautifulsoup-not-extracting-a-a-tag-from-a-dynamic-website

1条答案

按热度按时间

ryevplcw1#

使用他们的API获取数据：
注：API密钥有效期至5月，可以在here中找到。
例如：

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
    'Authorization': 'ApiKey GmEPb1B.bfqJLIhcGAsH9fTJevTglhFpCoZyAAAdhp',
}

url = "https://search.dip.bundestag.de/api/v1/aktivitaet?f.id=1628877"
documents = requests.get(url, headers=headers).json()["documents"]
print(documents[0]["fundstelle"]["pdf_url"])

输出：
https://dserver.bundestag.de/btp/20/20086.pdf#P.10313

赞(0）回复(0）举报 2023-04-04

我来回答

python BeautifulSoup没有< a>从动态网站中提取标签[重复]

1条答案

相关问题

热门标签

最新问答