我试图抓取这个website的图像,但是我无法获取图像src
,而是获取图像的 * 延迟加载 * src
属性。
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
url = "https://www.espncricinfo.com/series/indian-premier-league-2022-1298423/squads"
s = Service("M:\WebScraping\chromedriver.exe")
driver = webdriver.Chrome(service=s)
driver.maximize_window()
driver.get(url)
time.sleep(5)
driver.execute_script("window.scrollTo(0, 500);")
page = urllib.request.urlopen(url)
doc = BeautifulSoup(page, "html.parser")
teams = doc.find(class_="ds-p-0").find(class_="ds-mb-4")
for team in teams:
print(team.img["src"])
file_name = team.img["alt"]
img_file = open(file_name + ".png", "wb")
img_file.write(urllib.request.urlopen(team.img["src"]).read())
img_file.close()
这是我正在接收的输出。(这只是延迟加载的图像)
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
https://wassets.hscicdn.com/static/images/lazyimage-noaspect.svg
但我更想得到图像的源代码,就像这样-
https://img1.hscicdn.com/image/upload/f_auto,t_ds_square_w_160,q_50/lsci/db/PICTURES/CMS/333800/333885.png
1条答案
按热度按时间rt4zxlrg1#
BeautifulSoup无法加载javascript和其他内容,这就是为什么当您运行
另一方面,Selenium可以加载几乎所有的内容,所以你可以用Selenium加载页面,然后把它的页面源代码作为参数而不是url传递给BeautifulSoup:
这样BeautifulSoup将使用页面的全部HTML。下面的代码打印带有Selenium和BeautifulSoup的url,这样你就可以看到这两种技术。
产出