Scrapy Spider运行,但没有抓取页面

ct2axkht  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(239)

我是新的网页抓取,并试图运行一个简单的蜘蛛收集名称,品牌,和价格信息从一个网站出售山地自行车。我试图建立和运行一个蜘蛛所有在脚本上,因为我认为这是更简单的人在我的水平。蜘蛛运行,但产生的.csv文件是空的。在终端的消息后,试图运行蜘蛛让你知道,INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)。我不确定为什么这个蜘蛛不会爬行或收集任何信息从网站。
我已经测试了我代码中的URL,我已经测试了css定位器,它标识了我想从网站html中提取的信息(也尝试了使用xpath,但没有成功),我也尝试了几种不同的方法来编写我的循环,用来废弃网站的所有页面。
我唯一剩下的想法是,我的代码开头有一些语法错误,导致蜘蛛发生故障,或者我的循环有一些问题,试图爬取所有后续页面,而不仅仅是第一页。此外,这个网站有无限滚动。这会是一个问题吗?
我做错了什么?!?!?!
以下包含所有代码/错误消息/信息

  • 代码:
import scrapy
import requests
from scrapy.crawler import CrawlerProcess

class BikeSpider(scrapy.Spider):
    name='mountianbikespider'

    def start_requests(self):
        yield scrapy.Request('https://www.incycle.com/pages/search-results-page?collection=mountain-bikes&page=1')

    def parse(self, response):
        products = response.css('li.snize-product-in-stock')
        for item in products:      
            yield {
                'name' : item.css('span.snistrong textze-title::text').extract(),
                'description' : item.css('span.snize-description::text').extract(),
                'price' : item.css('span.snize-price::text').extract()    
            }

# this loop will make the spider not only crawl the first page of bikes, but also continue to all pages afterwards, collection the same info as on page 1

# to do this you must change the url to include page={x} in the place of page=1

        for x in range(2,10):
            yield(scrapy.Request(f'https://www.incycle.com/pages/search-results-page?collection=mountain-bikes&page={x}', callback=self.parse))

# this is what saves the data in a seperate place (in this case a csv namesbikes.csv)

process = CrawlerProcess(settings={
    "FEEDS":{"bikes.csv":{"format": "csv"}} 
})    

# this is what actually runs the spider

process.crawl(BikeSpider)
process.start()

谢谢!!!

dsekswqp

dsekswqp1#

1.您的xpath错误。
1.即使它是正确的,页面也是用JavaScript加载的,所以它不会工作。
1.这并不是一个错误--但是你可以在类的开头用start_urls = [f'https://www.incycle.com/pages/search-results-page?collection=mountain-bikes&page={i}' for i in range(1, 10)]替换分页循环,这样看起来会更好。
为了抓取页面的内容,您可以从script获取(我没有尝试,因为它更难),或者从API获取:

import scrapy
import logging

class BikeSpider(scrapy.Spider):
    name = 'mountianbikespider'
    start_urls = ['https://www.incycle.com/pages/search-results-page?collection=mountain-bikes&page=1']
    headers = {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-US,en;q=0.5",
        "Cache-Control": "no-cache",
        "Connection": "keep-alive",
        "DNT": "1",
        "Host": "searchserverapi.com",
        "Pragma": "no-cache",
        "Referer": "https://www.incycle.com/",
        "Sec-Fetch-Dest": "script",
        "Sec-Fetch-Mode": "no-cors",
        "Sec-Fetch-Site": "cross-site",
        "Sec-GPC": "1",
        "TE": "trailers",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
    }
    custom_settings = {'DOWNLOAD_DELAY': 0.6}
    API_key = ''
    start_index = 0
    max_results = 50
    total_pages_to_scrape = 3

    def start_requests(self):
        yield scrapy.Request(url=self.start_urls[0], callback=self.get_API_key)

    def get_API_key(self, response):
        API_key = response.xpath('//script[contains(text(), "searchserverapi")]/text()').re_first(r'a=(.+?)\\')
        if not API_key:
            self.log('Could not find the API key', logging.ERROR)
            return

        self.API_key = API_key
        url = f'https://searchserverapi.com/getresults?api_key={self.API_key}&q=&sortBy=created&sortOrder=desc&startIndex={self.start_index}&maxResults={self.max_results}&items=true&pages=true&categories=true&suggestions=true&queryCorrection=true&suggestionsMaxResults=3&pageStartIndex=0&pagesMaxResults=20&categoryStartIndex=0&categoriesMaxResults=20&facets=true&facetsShowUnavailableOptions=false&ResultsTitleStrings=2&ResultsDescriptionStrings=2&collection=mountain-bikes&action=moreResults&output=jsonp'
        yield scrapy.Request(url=url, headers=self.headers)

    def parse(self, response):
        json_data = response.json()

        for item in json_data['items']:
            yield {
                'name': item['title'],
                'description': item['description'],
                'price': item['price']
            }

        # next page
        self.start_index += self.max_results
        if self.start_index > self.total_pages_to_scrape*self.max_results:
            self.log('Finished scraping')
            return

        url = f'https://searchserverapi.com/getresults?api_key={self.API_key}&q=&sortBy=created&sortOrder=desc&startIndex={self.start_index}&maxResults={self.max_results}&items=true&pages=true&categories=true&suggestions=true&queryCorrection=true&suggestionsMaxResults=3&pageStartIndex=0&pagesMaxResults=20&categoryStartIndex=0&categoriesMaxResults=20&facets=true&facetsShowUnavailableOptions=false&ResultsTitleStrings=2&ResultsDescriptionStrings=2&collection=mountain-bikes&action=moreResults&output=jsonp'
        yield scrapy.Request(url=url, headers=self.headers)

如果你想得到其他的产品你需要改变collection在url为其他值,你可能要创建一个变量为方便。

相关问题