用Scrapy从多个页面中提取DOI

f87krz0w  于 2023-05-17  发布在  其他
关注(0)|答案(1)|浏览(193)

我有这个网页(https://academic.oup.com/plphys/search-results?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1),我想从中提取信息,例如,标题,名称,doi等。对于第一页,我可以很容易地做,但由于有更多的网页,我不能爬过。我的代码是:

import scrapy

class PhotosynSpiderSpider(scrapy.Spider):
    name = 'photosyn_spider'    
    allowed_domains = ['https://academic.oup.com/plphys']
    start_urls = ['https://academic.oup.com/plphys/search-results?q=photosynthesis&allJournals=1&fl_SiteID=6323']

    def parse(self, response):
        # Step 1: Locate the first page in div class 'pageNumbers al-pageNumbers'
        page_numbers = response.css('div.pageNumbers.al-pageNumbers')
        current_page = page_numbers.css('span.current-page::text').get()
        total_pages = page_numbers.css('span.total-pages::text').get()

        # Step 2: Locate link in a class 'al-citation-list', and extract all the href for doi in the element 'a'
        citation_list = response.css('a.al-citation-list')
        dois = citation_list.css('a::attr(href)').getall()

        for doi in dois:
            yield {'doi': doi}

        # Step 3: Open url for the next page in the element 'a' and class 'sr-nav-next al-nav-next' and repeat step 2
        if current_page != total_pages:
            next_page_url = response.css('a.sr-nav-next.al-nav-next::attr(href)').get()
            yield scrapy.Request(next_page_url, callback=self.parse)

我正在尝试将结果转储到JSON文件中。但是,结果为空。有人能帮我吗?谢谢
页面截图:

ddrv8njm

ddrv8njm1#

如果你看下一个页面元素,你会发现href属性不是一个实际的url:

<a role="button" aria-label="Next" href="javascript:;" class="sr-nav-next al-nav-next" data-url="q=photosynthesis&amp;allJournals=1&amp;fl_SiteID=6323&amp;page=2" data-google-interstitial="false">
   Next
</a>

这是因为点击下一步按钮并不会把你带到一个新的页面,相反,它使用javascript通过 AJAX 调用来交换文章部分的内容。
使用 AJAX 调用中使用的url,我们可以通过匹配它的模式从后续页面中获得所有结果。
例如:

import scrapy

class PhotosynSpiderSpider(scrapy.Spider):
    name = 'photosyn_spider'

    def start_requests(self):
        ajax_url = 'https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page='
        for i in range(1, 50):
            yield scrapy.Request(ajax_url + str(i))

    def parse(self, response):
        for row in response.css("div.sr-list.al-article-box.al-normal.clearfix"):
            doi = row.xpath(".//div[@class='al-citation-list']//a/@href").get()
            yield {"doi": doi}

第1-2页的输出:

{'doi': 'https://doi.org/10.1093/plphys/kiac484'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1093/plphys/kiaa026'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1093/plphys/kiaa032'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.120.2.599'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.109.139378'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.106.085167'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.106.085886'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.106.090449'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.119.2.553'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.015479'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.97.1.415'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.50.2.283'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.50.2.228'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.50.6.728'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.50.1.149'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.29.1.64'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.16.4.721'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1093/plphys/kiaa119'}
2023-05-09 23:07:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1> (referer: None)
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.73.4.1002'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.59.5.868'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.75.1.82'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.68.4.894'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.81.4.1115'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.59.5.859'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.93.4.1466'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.95.4.1270'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.48.6.712'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.89.2.409'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.89.4.1231'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.26.3.581'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.100.2.947'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.71.4.855'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.62.1.127'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.72.1.16'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.61.2.150'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.20.00264'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1093/plphys/kiac602'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1093/plphys/kiad183'}

注:在写这个答案的时候,网站上放了一个验证码。如果您试图在验证码处于活动状态时抓取网站,您需要做的就是从浏览器中复制cookie并将其插入start_requests方法中的每个请求中。

相关问题