selenium 用PYTHON SENSE选择srcset的第一个元素

eivgtgni  于 2022-11-10  发布在  Python
关注(0)|答案(3)|浏览(186)

通过在Python中使用Selify,我已经能够成功地访问我想要下载的图像的一些URL。但是,图像链接存储在srcset图像属性中。当我使用GET_ATTRIBUTE(‘srcset’)时,它返回一个包含4个链接的字符串。我只想要那个。我该如何着手做这件事呢?我能不能过后再剪断绳子?
这就是我要摘录的网站:
https://www.politicsanddesign.com/
以下是我的代码:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver import ActionChains
import pyautogui
import time

chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options = chrome_options)
driver.get('https://www.politicsanddesign.com/')
img_url = driver.find_element(By.XPATH, "//div[@class = 'responsive-image-wrapper']/img").get_attribute("srcset")
driver.get(img_url)

Img_url对象如下所示:

//images.ctfassets.net/00vgtve3ank7/6f38yjnNcU1d6dw0jt1Uhk/70dfbf208b22f7b1c08b7421f910bb36/2020_HOUSE_VA-04_D-MCEACHIN..jpg?w=400&fm=jpg&q=80 400w, //images.ctfassets.net/00vgtve3ank7/6f38yjnNcU1d6dw0jt1Uhk/70dfbf208b22f7b1c08b7421f910bb36/2020_HOUSE_VA-04_D-MCEACHIN..jpg?w=800&fm=jpg&q=80 800w, //images.ctfassets.net/00vgtve3ank7/6f38yjnNcU1d6dw0jt1Uhk/70dfbf208b22f7b1c08b7421f910bb36/2020_HOUSE_VA-04_D-MCEACHIN..jpg?w=1200&fm=jpg&q=80 1200w, //images.ctfassets.net/00vgtve3ank7/6f38yjnNcU1d6dw0jt1Uhk/70dfbf208b22f7b1c08b7421f910bb36/2020_HOUSE_VA-04_D-MCEACHIN..jpg?w=1800&fm=jpg&q=80 1800w

但我希望它只是:

//images.ctfassets.net/00vgtve3ank7/6f38yjnNcU1d6dw0jt1Uhk/70dfbf208b22f7b1c08b7421f910bb36/2020_HOUSE_VA-04_D-MCEACHIN..jpg?w=400&fm=jpg&q=80
jchrr9hc

jchrr9hc1#

该图像似乎有一个名为CurrentSrc的属性,该属性仅保存当前值。

img_url = driver.find_element(By.XPATH, "//div[@class = 'responsive-image-wrapper']/img").get_attribute("currentSrc")
driver.get(img_url)
lsmd5eda

lsmd5eda2#

您可以简单地拆分从该Web元素提取的值。
具体如下:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver import ActionChains
import pyautogui
import time

chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options = chrome_options)
driver.get('https://www.politicsanddesign.com/')
img_url = driver.find_element(By.XPATH, "//div[@class = 'responsive-image-wrapper']/img").get_attribute("srcset")
img_urls = img_url.split(",")

现在img_urls是一个包含3个URL的列表,您可以按如下方式使用:

driver.get(img_urls[0]) #open the first URL
driver.get(img_urls[1]) #open the second URL
driver.get(img_urls[2]) #open the third URL
hsgswve4

hsgswve43#

我的低效解决方案:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver import ActionChains
import pyautogui
import time

# WILL NEED TO EVENTUALLY FIGURE OUT HOW TO WRAP ALL OF THIS INTO A FUNCTION OR LOOP TO DO IT FOR ALL DIV OBJECTS

chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options = chrome_options)
driver.get('https://www.politicsanddesign.com/')
img_url = driver.find_element(By.XPATH, "//div[@class = 'responsive-image-wrapper']/img").get_attribute("srcset")
driver.get(img_url)
img_url2 = 'https:' + img_url.split(' 400w',1)[0]
driver.get(img_url2)

相关问题