显示50个divs中只有10个divs在scrapy网页抓取

nle07wnf  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(145)

我一直在尝试用scrapy为editorial data刮取此页
在编辑委员会成员部分中,54个div标签中有54个编辑器。
我试着从10个div标签中抓取数据,但只得到10个数据。

len(response.css("#moreGeneralEditors>div"))

10和用于获取数据的代码片段

import scrapy

class MdpjournalSpider(scrapy.Spider):
    name = 'try'
    start_urls = ["https://www.mdpi.com/journal/agrochemicals/editors"]

    def parse(self, response):
        outer_divs = response.css("div.middle-column__main.ul-spaced div.content__container>div")

        for inner_divs in outer_divs:
            if inner_divs.css("#moreGeneralEditors")!=[]:
                divs = inner_divs.css("#moreGeneralEditors>div")

                for inner_div in divs:
                    if inner_div.css("div.editor-div__content.img-exists")!=[]:
                        editor = inner_div.css("div.editor-div__content.img-exists:nth-child(2) b::text").get()
                        role = "editor"

                        yield {"editor":editor,"role":role}

                    elif inner_div.css("div.editor-div__content")!=[]:
                        editor = inner_div.css("div.editor-div__content:nth-child(1) b::text").get()
                        role = "editor"

                        yield {"editor":editor,"role":role}

有形象的编辑和没有形象的编辑是两个等级的,我只关心这个编委会成员,期刊上所有的编辑数据都有这个问题,这里是所有期刊列表的链接all journals

uplii1fm

uplii1fm1#

你只得到10个项目,因为其余的44个项目是通过API从外部源动态加载的。所以你必须使用API url来代替。

范例:

import scrapy
class TestSpider(scrapy.Spider):
    name = 'test'  
    def start_requests(self):
        api_url = 'https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3'
        headers= {
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
            'x-requested-with': 'XMLHttpRequest'
            }
        yield scrapy.Request(url=api_url, method='GET',callback=self.parse,headers=headers)

    def parse(self, response):
        pass

        members = response.xpath('//*[@class="editor-div__content "][1]/b') + response.xpath('//*[@class="editor-div__content img-exists"][1]/b')
        for member in members:

            yield {
                "editor": member.xpath('.//text()').get()
                }

输出:

{'editor': ' Dr. Pasquale Comberiati'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Audrey DunnGalvin'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Monica Greco'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Inkyu Hwang'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Inaki Izquierdo'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Gisèle Kanny'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Chang Kim'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Rosario Linacero'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Soheila J. Maleki'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Giuseppe Murdaca'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Kazuyuki Nakagome'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Eleonora Nucera'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Franziska Roth-Walter'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Youn Young Shim'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Carina Gabriela Uasuf'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Joana Costa'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Magdalena Czarnecka-Operacz'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Danilo Di Bona'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Araceli Díaz -Perales'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Maria Gasset'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Elena Gimenez-Arnau'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Houman Goudarzi'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Lars Hellman'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Christiane Hilger'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Russell Hopp'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Mats W. Johansson'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Marat V. Khodoun'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Uday Kishore'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Rebecca Knibb'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Heung-Man Lee'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Isabel Mafra'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Mario Malerba'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Arduino A. Mangoni'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Nobuaki Miyahara'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Linda Monaci'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Tatsuya Moriyama'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Maria Pino-Yanes'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Daniel P. Potaczek'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Antonietta Rossi'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Ann-Marie Malby Schoos'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Gregory Seumois'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Cenk Suphioglu'}
2022-08-23 00:46:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Junji Yodoi'}
2022-08-23 00:46:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Gianvincenzo Zuccotti'}
2022-08-23 00:46:22 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-23 00:46:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 369,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 11876,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 1.539094,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 8, 22, 18, 46, 22, 26301),
 'httpcompression/response_bytes': 59114,
 'httpcompression/response_count': 1,
 'item_scraped_count': 44,

相关问题