Python Scrapy网页搜罗:获取包含 AJAX 内容的onclick元素内的URL时出现问题

xsuvu9jc  于 2022-11-09  发布在  Python
关注(0)|答案(1)|浏览(138)

我是scrapy的初学者。我试着从www.example.com上抓取特定书籍的用户评论goodreads.com。我想抓取所有关于书籍的评论,所以我必须解析每个评论页面。在每个评论页面下面有一个next_page按钮,嵌入在onclick元素中的next_page按钮的内容,但存在问题。这个onclick链接包含 AJAX 请求,我不知道如何处理这种情况。提前感谢您的帮助。
Picture of the next_page button
Its the content of onclick button
Its the remaining part of the onclick button
我也是初学者的张贴stackoverflow,我很抱歉,如果我有任何错误。:)
我分享我的代码刮在下面
此外,它的例子链接之一的书,有一个审查部分下面的网页。
a book_link

import scrapy
from ..items import GoodreadsItem
from scrapy import Request
from urllib.parse import urljoin
from urllib.parse import urlparse

class CrawlnscrapeSpider(scrapy.Spider):
    name = 'crawlNscrape'
    allowed_domains = ['www.goodreads.com']
    start_urls = ['https://www.goodreads.com/list/show/702.Cozy_Mystery_Series_First_Book_of_a_Series']

    def parse(self, response):

        #collect all book links in this page then make request for 
        #parse_page function
        for href in response.css("a.bookTitle::attr(href)") :
            url=response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_page)

        #go to the next page and make request for next page and call parse 
        #function again
        next_page = response.xpath("(//a[@class='next_page'])[1]/@href")
        if next_page:
            url= response.urljoin(next_page[0].extract())
            yield scrapy.Request(url, self.parse)

    def parse_page(self, response):

        #call goodreads item and create empty dictionary with name book
        book = GoodreadsItem()
        title = response.css("#bookTitle::text").get()
        reviews = response.css(".readable span:nth-child(2)::text").getall()

        #add book and reviews that earned into dictionary
        book['title'] = title
        book['reviews'] = reviews#take all reviews about book in single page

        # i want to extract all of the review pages for any book ,
        # but there is a ajax request in onclick button
        # so i cant scrape link of next page.
        next_page = response.xpath("(//a[@class='next_page'])[1]/@onclick")
        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield scrapy.Request(url,callback=self.parse_page)

        yield book
flvlnr44

flvlnr441#

而不是以下代码:

next_page = response.xpath("(//a[@class='next_page'])[1]/@onclick")
if next_page:
    url = response.urljoin(next_page[0].extract())
    yield scrapy.Request(url,callback=self.parse_page)

请尝试以下操作:
首先导入此存储库:

from re import search

则使用以下内容进行分页:

next_page_html = response.xpath("//a[@class='next_page' and @href='#']/@onclick").get()
if next_page_html != None:
    next_page_href = search( r"Request\(.([^\']+)", next_page_html)
    if next_page_href:
        url = response.urljoin(next_page_href.group(1))
        yield scrapy.Request(url,callback=self.parse_page)

相关问题