在Python中使用Selenium BeautifulSoup抓取网页需要更多时间

l7wslrjt  于 2023-05-13  发布在  Python
关注(0)|答案(1)|浏览(126)

我从产品页面抓取产品链接<a href="">并将它们存储在数组hrefs

from bs4 import BeautifulSoup
from selenium import webdriver
import os
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
service = webdriver.chrome.service.Service(executable_path=os.getcwd() + "./chromedriver.exe")
driver = webdriver.Chrome(service=service, options=chrome_options)
driver.set_page_load_timeout(900)
link = 'https://www.catch.com.au/seller/vdoo/products.html?page=1'
driver.get(link)
soup = BeautifulSoup(driver.page_source, 'lxml')
product_links = soup.find_all("a", class_="css-1k3ukvl")

hrefs = []
for product_link in product_links:
    href = product_link.get("href")
    if href.startswith("/"):
        href = "https://www.catch.com.au" + href
    hrefs.append(href)

有大约36个链接存储在数组中的所有36个产品的页面上,然后我开始挑选每个链接从hrefs和去它和 scrapy 进一步的数据从每个链接。

products = []
for href in hrefs:
    driver.get(href)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    
    title = soup.find("h1", class_="e12cshkt0").text.strip()
    price = soup.find("span", class_="css-1qfcjyj").text.strip()
    image_link = soup.find("img", class_="css-qvzl9f")["src"]
    product = {
        "title": title,
        "price": price,
        "image_link": image_link
    }
    products.append(product)
driver.quit()
print(len(products))

但这太费时间了。我已经设置了900秒但超时。问题:
1.在开始时,现在,我只是从第一页获取产品链接,但我有更多的页面,如多达40页,每页36个产品。当我实现从所有页面获取数据时,它也会超时。
1.然后在第二部分,当我使用这些链接和 scrapy 的每一个链接,那么它也需要更多的时间。我怎样才能减少这个程序的执行时间。我能把节目分成几部分吗?

2j4z5cfb

2j4z5cfb1#

您可以跳过selenium,直接使用其 AJAX API获取结果。例如:

import requests
from bs4 import BeautifulSoup

api_url = "https://www.catch.com.au/seller/vdoo/products.json"

params = {
    "page": 1,  # <-- to get other pages, increase this parameter
}

data = requests.get(api_url, params=params).json()

urls = []
for r in data['payload']['results']:
    urls.append(f"https://www.catch.com.au{r['product']['productPath']}")

for url in urls:
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    price = soup.select_one('[itemprop=price]')['content']
    title = soup.h1.text
    print(f'{title:<100} {price:<5}')

图纸:

2x Pure Natural Cotton King Size Pillow Case Cover Slip - 54x94cm - White                            46.99
Fire Starter Lighter Waterproof Flint Match Metal Keychain Camping Survival - Gold                   20.89
Plain Solid Colour Cushion Cover Covers Decorative Pillow Case - Apple Green                         20.9 
2000TC 4PCS Bed Sheet Set Flat Fitted Pillowcase Single Double Queen King Bed - Black                57.18
All Size Bed Ultra Soft Quilt Duvet Doona Cover Set Bedding - Paris Eiffel Tower                     50.99

...and so on.

相关问题