Scrapy剧作家向下滚动并等待加载html

rn0zuynd  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(168)

我有一个页面
这是一个很好的例子,它可以帮助你在跑步的过程中找到正确的方向。描述@tm10;升序&fe=&pf=标识子族=10005@&start=0
向下滚动后会自动加载产品(最初只显示48个产品)。总共应该有大约630个产品。
这是我的蜘蛛代码。我总是只得到48个结果而不是630+。知道为什么它没有全部加载吗?

import scrapy
from scrapy_playwright.page import PageMethod

class PicturesSpider(scrapy.Spider):
    name = 'pictures'
    allowed_domains = ['www.tradeinn.com']
    start_urls = ['http://www.tradeinn.com/']

    def start_requests(self):
        yield scrapy.Request(url='https://www.tradeinn.com/runnerinn/en/mens-shoes-trail-running-shoes/10005/s#fq=id_familia=10002&sort=v30_sum;desc@tm10;asc&fe=&pf=id_subfamilia=10005&&start=144',
                             meta={'playwright': True,
                                   'playwright_include_page': True,
                                   'playwright_page_method': [PageMethod('wait_for_selector', 'div::boton_cargar_mas.color_runnerinn'),
                                                              PageMethod("evaluate", "window.scrollBy(0, document.body.scrollHeight)")]},
                             callback=self.parse)

    def parse(self, response):
        images = response.css('div.BoxImage')
        for image in images:
            image_link = image.css('img::attr(src)').get()
            image_description = image.css('img::attr(alt)').get()
            yield {
                'image_link': image_link,
                'image_description': image_description
            }

有什么建议我应该改变什么,以获得完整的内容?

t98cgbkg

t98cgbkg1#

下面是一种方法来获得该页面上的图像-显然只有398,而不是600+ -它甚至在顶部的面包屑中指定- Trail Running Shoes(398)). Solution是基于Selenium的,欢迎您对其进行功能化,OOP,无论如何,我只给您获得图像的实际方法.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time as t

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("--headless")
chrome_options.add_argument("window-size=1280,1080")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(browser, 20)
actions = ActionChains(browser)

url = 'https://www.tradeinn.com/runnerinn/en/mens-shoes-trail-running-shoes/10005/s#fq=id_familia=10002&sort=v30_sum;desc@tm10;asc&fe=&pf=id_subfamilia=10005@&start=0'
browser.get(url)
pbody = wait.until(EC.presence_of_element_located((By.TAG_NAME, 'body')))
for x in range(14):
    pbody.send_keys(Keys.PAGE_DOWN)
    print('scrolled')
    t.sleep(1)
t.sleep(5)
images = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'img[class="imagen_buscador"]')))
print(len(images))
for i in images:
    print(i.get_attribute('src'))

终端中的结果:

scrolled
scrolled
scrolled
[...]
398
https://www.tradeinn.com/h/13855/138556150/nike-juniper-trail-running-shoes.jpg
https://www.tradeinn.com/h/13855/138552239/nike-react-pegasus-4-trail-running-shoes.jpg
https://www.tradeinn.com/h/13727/137276760/merrell-vapor-glove-3-shoes.jpg
https://www.tradeinn.com/h/13842/138429239/adidas-terrex-agravic-flow-2-goretex-trail-running-shoes.jpg
https://www.tradeinn.com/h/13855/138556425/nike-wildhorse-7-trail-running-shoes.jpg
https://www.tradeinn.com/h/13842/138429416/adidas-terrex-two-goretex-trail-running-shoes.jpg
https://www.tradeinn.com/h/13842/138429408/adidas-terrex-two-boa-trail-running-shoes.jpg
https://www.tradeinn.com/h/13855/138556423/nike-wildhorse-7-trail-running-shoes.jpg
https://www.tradeinn.com/h/13857/138574338/new-balance-410v7-all-terrain-trail-running-shoes.jpg
https://www.tradeinn.com/h/13857/138574792/new-balance-fresh-foam-x-hierro-v7-trail-running-shoes.jpg
https://www.tradeinn.com/h/13857/138574340/new-balance-410v7-all-terrain-trail-running-shoes.jpg
https://www.tradeinn.com/h/13789/137892027/adidas-terrex-two-boa-trail-running-shoes.jpg
https://www.tradeinn.com/h/13727/137276761/merrell-vapor-glove-3-shoes.jpg
https://www.tradeinn.com/h/13836/138368054/joma-trek-trail-running-shoes.jpg
https://www.tradeinn.com/h/13710/137107634/vibram-fivefingers-v-trail-2.0-trail-running-shoes.jpg
https://www.tradeinn.com/h/13789/137891690/adidas-terrex-agravic-flow-trail-running-shoes.jpg
https://www.tradeinn.com/h/13789/137892315/adidas-terrex-swift-r3-trail-running-shoes.jpg
https://www.tradeinn.com/h/13855/138552241/nike-react-pegasus-4-trail-running-shoes.jpg
https://www.tradeinn.com/h/13803/138030718/nike-wildhorse-7-trail-running-shoes.jpg
[..]

有关Selenium文档,请参见https://www.selenium.dev/documentation/

相关问题