scrapy 在单独的方法中将自身参数传递给回调函数

hgb9j2n6  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(229)

我正在尝试重构我的代码和打破的东西来学习,我打破了一些东西,希望你能帮助我学习。
我得到了一个运行在多个页面的工作刮刀如下:

class someSpider(scrapy.Spider):
    name = 'spider_name'
    allowed_domains = ['www.example.com']
    start_urls = ['https://www.example.com&page=1']

    def parse(self, response): 
        result_parsed = json.loads(result)
        results = result_parsed.get('results') #yield actual results
        current_page_number = result_parsed.get('currentPage') #gets the page from the link as part of the API response

        for result in results:
            count += 1

           yield{ 

            ... #gives me the results as desired

            }

        go_to_nextpage(self, current_page_number) #### THIS DOES NOT WORK, not error, just stops at one page ####

        #### THIS WORKS ####
        # next_page_number = result_parsed.get('currentPage') +1
        # yield scrapy.Request(
        #     url=f'https://www.immoweb.be/en/search-results/house-and-apartment/for-sale/brussels/district?countries=BE&hasRecommendationActivated=true&page={next_page_number}&orderBy=relevance&searchType=similar',
        #     callback=self.parse
        # )

将next_page_number()定义为:

def go_to_nextpage(self, current_page_number):
    next_page_number = current_page_number +1
    yield scrapy.Request(
        url=f'https://www.example.com&page={next_page_number}',
        callback=self.parse
    )

我想我不太明白两件事:

  • self关键字的作用
  • 回调方法和解析方法的工作/交互方式

任何帮助都是感激不尽的

mec1mxoz

mec1mxoz1#

有几个问题,我希望可以帮助澄清。

  • 您没有正确使用self参数。
  • 在Python中,当你调用一个类方法时,就像这样:myclass.method(); myclass是类别执行严修的变数。
  • 当从另一个示例方法内部调用同一个方法时,将使用self变量(作为第一个参数自动注入):self.method() .
  • 在您的代码上下文中,它应该如下所示:self.go_to_nextpage(current_page_number)
  • Scrapy只能处理从它的解析器回调返回/产生的请求。
  • 您正确地生成了所指示的第一个项,但请求由go_to_nextpage方法生成的项,因为当前代码不对返回值执行任何操作。
  • 另一个问题是,您在go_to_nextpage中生成了一个结果,它会自动将该方法转换为生成器
  • 最简单的解决方案是直接返回请求而不是生成请求。

下面是一个示例,它应该是这样的:

class someSpider(scrapy.Spider):
    name = 'spider_name'
    allowed_domains = ['www.example.com']
    start_urls = ['https://www.example.com&page=1']

    def parse(self, response): 
        result_parsed = json.loads(result)
        results = result_parsed.get('results')
        current_page_number = result_parsed.get('currentPage') 
        for result in results:
            count += 1
            yield{ something } 
        # go_to_nextpage(self, current_page_number) <- this line is the issue
        # because you don't handle the return value.
        yield self.go_to_nextpage(current_page_number)

    def go_to_nextpage(self, current_page_number):
        next_page_number = current_page_number +1
        return scrapy.Request(url=(f'https://www.immoweb.be/en/search-results/house-and-apartment/for-sale/brussels/district?countries=                 BE&hasRecommendationActivated=true&page={next_page_number}&orderBy=relevance&searchType=similar',
                              callback=self.parse)

如果你想在你的go_to_nextpage方法中使用yield,你可以这样写。

class someSpider(scrapy.Spider):
    def parse(self, request):
        ...
        # because you don't handle the return value.
        for i in self.go_to_nextpage(current_page_number):
            yield

    def go_to_nextpage(self, current_page_number):
        next_page_number = current_page_number +1
        yield scrapy.Request(url=(f'https://www.immoweb.be/en/search-results/house-and-apartment/for-sale/brussels/district?countries=                 BE&hasRecommendationActivated=true&page={next_page_number}&orderBy=relevance&searchType=similar',
                              callback=self.parse)

相关问题