scrapy 不一致和不完整?

cczfrluj  于 2022-12-18  发布在  其他
关注(0)|答案(1)|浏览(167)

我试图从这个域中获取所有的xml文件链接。当我使用scrappy shell时,我得到了我所期望的相对链接。

>>> response.xpath('//div[@class="toolbar"]/a[contains(@href, ".xml")]/@href').extract()[1]
'/dhq/vol/16/3/000642.xml'

但是当我试图生成所有链接时,我最终得到的csv文件中包含了所有不完整的链接,或者只是根链接。
数据集示例:https://pastebin.com/JqCKnxV5

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class DhqSpider(CrawlSpider):
    name = 'dhq'
    allowed_domains = ['digitalhumanities.org']
    start_urls = ['http://www.digitalhumanities.org/dhq/vol/16/3/index.html']

    rules = (
            Rule(LinkExtractor(allow = 'index.html')), 
            Rule(LinkExtractor(allow = 'vol'), callback='parse_xml'),        
        )
    
    def parse_xml(self, response):
        xmllinks = response.xpath('//div[@class="toolbar"]/a[contains(@href, ".xml")]/@href').extract()[1]
        for link in xmllinks:
                yield{
                    'file_urls': [response.urljoin(link)]
                }

我的urljoin中缺少了什么,导致了这些不完整的链接和/或根链接?

falq053o

falq053o1#

CrowlSpider从每个详细页面中抓取数据,您的选择选择了两个元素,但您必须只选择一个,然后您可以应用xpath表达式的内置索引,以避免不必要的for循环。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class DhqSpider(CrawlSpider):
    name = 'dhq'
    allowed_domains = ['digitalhumanities.org']
    start_urls = ['http://www.digitalhumanities.org/dhq/vol/16/3/index.html']

    rules = (
            Rule(LinkExtractor(allow = 'index.html')), 
            Rule(LinkExtractor(allow = 'vol'), callback='parse_xml'),        
        )
    
    def parse_xml(self, response):
        xmllink = response.xpath('(//div[@class="toolbar"]/a[contains(@href, ".xml")]/@href)[1]').get()
        
        yield{
            'file_urls': response.urljoin(xmllink)
            }

输出:

{'file_urls': 'http://www.digitalhumanities.org/dhq/vol/12/1/000355.xml'}
2022-12-14 20:28:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.digitalhumanities.org/dhq/vol/12/1/000346/000346.html> (referer: http://www.digitalhumanities.org/dhq/vol/12/1/index.html)
2022-12-14 20:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.digitalhumanities.org/dhq/vol/12/1/000346/000346.html>
{'file_urls': 'http://www.digitalhumanities.org/dhq/vol/12/1/000346.xml'}
2022-12-14 20:29:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.digitalhumanities.org/dhq/vol/12/1/000362/000362.html> (referer: http://www.digitalhumanities.org/dhq/vol/12/1/index.html)
2022-12-14 20:29:03 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.digitalhumanities.org/dhq/vol/12/1/000362/000362.html>
{'file_urls': 'http://www.digitalhumanities.org/dhq/vol/12/1/000362.xml'}
2022-12-14 20:29:03 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-14 20:29:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 242004,
 'downloader/request_count': 754,
 'downloader/request_method_count/GET': 754,
 'downloader/response_bytes': 69368110,
 'downloader/response_count': 754,
 'downloader/response_status_count/200': 754,
 'dupefilter/filtered': 3221,
 'elapsed_time_seconds': 51.448049,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 12, 14, 14, 29, 3, 317586),
 'item_scraped_count': 697,

...等等

更新日期:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class DhqSpider(CrawlSpider):
    name = 'dhq'

    allowed_domains = ['digitalhumanities.org']
    start_urls = ['http://www.digitalhumanities.org/dhq/vol/16/3/index.html']

    
    rules = (
            Rule(LinkExtractor(allow = 'index.html')), 
            Rule(LinkExtractor(allow = 'vol'), callback='parse_xml'),        
        )
    
    def parse_xml(self, response):
        #xmllink = response.xpath('(//div[@class="toolbar"]/a[contains(@href, ".xml")]/@href)[1]').get()
        #'file_urls': response.urljoin(xmllink)
        
        yield { 
            'title' : response.css('h1.articleTitle::text').get().strip().replace('\n', ' ').replace('\t',''),
            'author' : response.css('div.author a::text').get().strip(),
            'pubinfo' : response.css('div#pubInfo::text').getall(),
            'xmllink' :response.urljoin( response.xpath('(//div[@class="toolbar"]/a[contains(@href, ".xml")]/@href)[1]').get()),
            #'referrer_url' : response.url
        }

输出:

{
        "title": "Textension: Digitally Augmenting Document Spaces                     in Analog Texts",
        "author": "Adam James Bradley",
        "pubinfo": [
            "2019",
            "Volume 13 Number 3"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/13/3/000426.xml"
    },
    {
        "title": "Building the",
        "author": "Cait Coker",
        "pubinfo": [
            "2019",
            "Volume 13 Number 3"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/13/3/000428.xml"
    },
    {
        "title": "Dendrography and Art History: a                     computer-assisted analysis of Cézanne’s",
        "author": "Melinda Weinstein",
        "pubinfo": [
            "2019",
            "Volume 13 Number 3"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/13/3/000423.xml"
    },
    {
        "title": "The Invisible Work of the Digital Humanities                     Lab: Preparing Graduate Students for Emergent Intellectual and Professional                     Work",
        "author": "Dawn Opel",
        "pubinfo": [
            "2019",
            "Volume 13 Number 2"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/13/2/000421.xml"
    },
    {
        "title": "Modelling Medieval Hands: Practical OCR for                     Caroline Minuscule",
        "author": "Brandon W. Hawk",
        "pubinfo": [
            "2019",
            "Volume 13 Number 1"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/13/1/000412.xml"
    },
    {
        "title": "Introduction:                     Questioning",
        "author": "Tarez Samra Graban",
        "pubinfo": [
            "2019",
            "Volume 13 Number 2"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/13/2/000416.xml"
    },
    {
        "title": "Racism in the Machine: Visualization Ethics in                     Digital Humanities Projects",
        "author": "Katherine Hepworth",
        "pubinfo": [
            "2018",
            "Volume 12 Number 4"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/12/4/000408.xml"
    },
    {
        "title": "Narrelations — Visualizing Narrative Levels and their Correlations with Temporal Phenomena",
        "author": "Hannah Schwan",
        "pubinfo": [
            "2019",
            "Volume 13 Number 3"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/13/3/000414.xml"
    },
    {
        "title": "Towards 3D Scholarly Editions: The Battle of                     Mount Street Bridge",
        "author": "Costas Papadopoulos",
        "pubinfo": [
            "2019",
            "Volume 13 Number 1"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/13/1/000415.xml"
    },
    {
        "title": "Visual Communication and the promotion of                     Health: an exploration of how they intersect in Italian education",
        "author": "Viviana De Angelis",
        "pubinfo": [
            "2018",
            "Volume 12 Number 4"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/12/4/000407.xml"
    },
    {
        "title": "Best Practices: Teaching Typographic Principles                     to Digital Humanities Audiences",
        "author": "Amy Papaelias",
        "pubinfo": [
            "2018",
            "Volume 12 Number 4"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/12/4/000405.xml"
    },
    {
        "title": "Placing Graphic                     Design at the Intersection of Information Visualization Fields",
        "author": "Yvette Shen",
        "pubinfo": [
            "2018",
            "Volume 12 Number 4"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/12/4/000406.xml"
    },
    {
        "title": "Making and Breaking: Teaching Information Ethics                     through Curatorial Practice",
        "author": "Christina Boyles",
        "pubinfo": [
            "2018",
            "Volume 12 Number 4"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/12/4/000404.xml"
    },
    {
        "title": "Critically engaging with data visualization                     through an information literacy framework",
        "author": "Steven Braun",
        "pubinfo": [
            "2018",
            "Volume 12 Number 4"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/12/4/000402.xml"
    },
    {
        "title": "Renaissance Remix.",
        "author": "Deanna Shemek",
        "pubinfo": [
            "2018",
            "Volume 12 Number 4"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/12/4/000400.xml"
    },
    {
        "title": "Crowdsourcing Image Extraction and Annotation:                     Software Development and Case Study",
        "author": "Ana Jofre",
        "pubinfo": [
            "2020",
            "Volume 14 Number 2"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/14/2/000469.xml"
    },
    {
        "title": "Defining scholarly practices, methods and tools                     in the Lithuanian digital humanities research community",
        "author": "Ingrida Kelpšienė",
        "pubinfo": [
            "2018",
            "Volume 12 Number 4"
        ],
        "xmllink": "http://www.digitalhumanities.org/dhq/vol/12/4/000401.xml"
    }
]

相关问题