我一直得到403错误时,使用scrapy,即使我有适当的标题设置。网站,我试图刮是https://steamdb.info/graph/。
我的代码:
def start_request(self):
headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Mobile Safari/537.36",
"accept": "application/json",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9,en-GB;q=0.8,ar;q=0.7",
"cache-control":" no-cache",
"pragma": "no-cache",
"referer": "https://steamdb.info/graph/",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"x-requested-with": "XMLHttpRequest"
}
yield scrapy.Request(url = 'https://steamdb.info/graph', method='GET', headers = headers, callback=self.parse)
def parse(self, response):
#stuff to do
错误:
2022-07-08 20:20:41 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://steamdb.info/graph> (referer: https://steamdb.info/graph/)
2022-07-08 20:20:41 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://steamdb.info/graph>: HTTP status code is not handled or not allowed
3条答案
按热度按时间ut6juiuv1#
该网站受cloudflare保护。
它与
cloudscraper
一起工作,这相当于requests
模块可以处理云耀斑保护。输出:
tzcvj98z2#
这是因为该站点不存在-https:steamdb.info/graphs/转到404
谢谢
n9vozmp43#
我解决了这个问题。如果一个网站使用的是cloudfare,你可以使用未检测到的chrome驱动程序,并将其作为
scrapy middleware
使用。将此添加到Middleware.py:
Settings.py:
my_scraper.py:我的电脑