如何使用scrapy抓取页面到href元素？

ifmq2ha2 于 2023-03-23 发布在其他

关注(0)|答案(2)|浏览(171)

我试图抓取页面并打印hrefs，但我没有得到任何响应.这里是蜘蛛-

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class Shoes2Spider(CrawlSpider):
    name            = "shoes2"
    allowed_domains = ["stockx.com"]
    start_urls      = ["https://www.stockx.com/sneakers"]

    rules = (
        Rule(
            LinkExtractor(restrict_xpaths = "//div[@class='css-pnc6ci']/a"), 
            callback                      = "parse_item", 
            follow                        = True
        ),
    )

    def parse_item(self, response):
        print(response.url)

这里还有一个我试图提取的hrefs的例子-x1c 0d1x
当我运行spider时，我期望看到40 hrefs，但是我什么也没有得到。我做错了什么？
这里也是在终端中创建项目的代码-

scrapy startproject stockx

cd stockx

scrapy genspider -t crawl shoes2 www.stockx.com/sneakers

scrapy

来源：https://stackoverflow.com/questions/75773377/how-do-i-crawl-a-page-to-the-href-elements-using-scrapy

2条答案

按热度按时间

4c8rllxm1#

所以我刚刚意识到start_urls被设置为["https://www.stockx.com"]。我把它改成了["https://www.stockx.com/sneakers"]，这似乎解决了这个问题。

赞(0）回复(0）举报 2023-03-23

2wnc66cl2#

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class Shoes2Spider(CrawlSpider):
    name = "shoes2"
    allowed_domains = ["stockx.com"]
    start_urls = ["https://www.stockx.com/sneakers"]

    rules = (
        Rule(LinkExtractor(restrict_css='[data-testid="RouterSwitcherLink"]'), callback="parse_item", follow= True),
    )

    def parse_item(self, response):
        print(response.url)

这是将工作.我已经更新了CSS restrict_css='[data-testid="RouterSwitcherLink"]'

赞(0）回复(0）举报 2023-03-23

我来回答

如何使用scrapy抓取页面到href元素？

2条答案

相关问题

热门标签

最新问答