Scrapy crawler,爬行南威尔士课程时出现403错误

t0ybt7op  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(154)

我一直在抨击我的头对这一段时间,并认为我会把它交给Maven的互联网上的一点援助。
我试图用scrappy来抓取南威尔士大学的课程列表(当然都是公共信息),但是每当我这么做的时候,我都会遇到一个403,阻止我获取任何信息。
下面是我的spider代码:

import scrapy

class CrawlingSpider(scrapy.Spider):
    name = "southwalescrawler"
    start_urls = ["https://www.southwales.ac.uk/courses/"]
    download_delay = 2

    def parse(self, response):
        pass

    def start_requests(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/58.0.3029.110 Safari/537.3',
            'Referer': 'https://www.southwales.ac.uk/'
        }
        cookies = {'cookie_name': 'cookie_value'}
        for url in self.start_urls:
            yield scrapy.Request(url, headers=headers, cookies=cookies, callback=self.parse)

字符串
你会看到我正在处理cookie,延迟请求,并应用用户代理和引用。尽管如此,这是我得到的结果:

2023-12-15 11:51:45 [scrapy.core.engine] INFO: Spider opened
2023-12-15 11:51:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-12-15 11:51:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-12-15 11:51:45 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.southwales.ac.uk/robots.txt> (referer: None)
2023-12-15 11:51:45 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2023-12-15 11:51:48 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.southwales.ac.uk/courses/> (referer: https://www.southwales.ac.uk/)
2023-12-15 11:51:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.southwales.ac.uk/courses/>: HTTP status code is not handled or not allowed
2023-12-15 11:51:48 [scrapy.core.engine] INFO: Closing spider (finished)

ckocjqey

ckocjqey1#

我不知道以后是否有人会发现这一点,并希望有一个问题,实际上是关于如何使用scrapy这样的网站,但我设法解决了这个问题,通过放弃scrapy和使用Selenium手动创建一个web scraper,它只得到我所追求的课程信息。它需要非无头通过安全,但至少它的乐趣看执行。

相关问题