使用scrapy进行网页抓取分页

qhhrdooz  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(199)

我尝试用Scrapy抓取网页。但是我收到了“重复”的警告。不能跳到下一页。
我怎样才能把所有的页面都标上页码?
示例站点:teknosa.com
抓取URL:https://www.teknosa.com/bilgisayar-tablet-c-116
分页结构:?s=%3Arelevance&page=0(1、2、3、4、5及更多..)
我的分页代码:

next_page = soup.find('button', {'title': 'Daha Fazla Ürün Gör'})['data-gotohref']
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)
kgsdhlau

kgsdhlau1#

您可以在start_urls中进行分页,并增加或减少页码范围。

import scrapy
from scrapy.crawler import CrawlerProcess

class CarsSpider(scrapy.Spider):
    name = 'car'
    start_urls=['https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page='+str(x)+'' for x in range(1,11)]

    def parse(self, response):
       print(response.url)

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl()
    process.start()

输出:

https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=1
2022-05-01 08:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=2> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=2
2022-05-01 08:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=5> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=5
2022-05-01 08:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=6> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=6
2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=7> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=7
2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=3> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=3
2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=4> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=4
2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=8> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=8
2022-05-01 08:55:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=9> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=9
2022-05-01 08:55:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=10> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=10

多个URL,使用for循环进行分页

import scrapy

class CarsSpider(scrapy.Spider):
    name = 'car'

    def start_requests(self):
        urls=['url_1', 'url_2', 'url_3', ...]
        for url in urls:
            yield scrapy.Request(url=url,callback=self.parse)

    def parse(self, response):
        ...

相关问题