如何使用scrapy抓取页面到href元素?

ifmq2ha2  于 2023-03-23  发布在  其他
关注(0)|答案(2)|浏览(164)

我试图抓取页面并打印hrefs,但我没有得到任何响应.这里是蜘蛛-

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class Shoes2Spider(CrawlSpider):
    name            = "shoes2"
    allowed_domains = ["stockx.com"]
    start_urls      = ["https://www.stockx.com/sneakers"]

    rules = (
        Rule(
            LinkExtractor(restrict_xpaths = "//div[@class='css-pnc6ci']/a"), 
            callback                      = "parse_item", 
            follow                        = True
        ),
    )

    def parse_item(self, response):
        print(response.url)

这里还有一个我试图提取的hrefs的例子-x1c 0d1x
当我运行spider时,我期望看到40 hrefs,但是我什么也没有得到。我做错了什么?
这里也是在终端中创建项目的代码-

scrapy startproject stockx

cd stockx

scrapy genspider -t crawl shoes2 www.stockx.com/sneakers
4c8rllxm

4c8rllxm1#

所以我刚刚意识到start_urls被设置为["https://www.stockx.com"]。我把它改成了["https://www.stockx.com/sneakers"],这似乎解决了这个问题。

2wnc66cl

2wnc66cl2#

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class Shoes2Spider(CrawlSpider):
    name = "shoes2"
    allowed_domains = ["stockx.com"]
    start_urls = ["https://www.stockx.com/sneakers"]

    rules = (
        Rule(LinkExtractor(restrict_css='[data-testid="RouterSwitcherLink"]'), callback="parse_item", follow= True),
    )

    def parse_item(self, response):
        print(response.url)

这是将工作.我已经更新了CSS restrict_css='[data-testid="RouterSwitcherLink"]'

相关问题