当尝试使用python抓取网站产品时,HTTP状态代码未在Scrapy中处理或不允许

x33g5p2x  于 2023-03-12  发布在  Python
关注(0)|答案(1)|浏览(176)

我遇到此错误[scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.bigbasket.com> (referer: None)[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.bigbasket.com>: HTTP status code is not handled or not allowed
为了解决这个问题,我在互联网上搜索,并得到了一些解决方案,但不工作的他们,像我已经尝试scrapy-user-agents并粘贴此代码在setting.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

我也试过这个pip install scrapy-random-useragent
我想刮所有的产品从这个网站。请有人能帮助它解决这个问题吗?这是我的代码。

from urllib.parse import urljoin
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from datetime import datetime
import pandas as pd

class GrocerySpider(scrapy.Spider):
    name = "bigbasket"
    
    def start_requests(self):
        url = "https://www.bigbasket.com"
        
        yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        url = response.url
        print("wow")
        
if __name__ == '__main__':
    settings = get_project_settings()
    process = CrawlerProcess(settings)
    process.crawl(GrocerySpider)
    
    process.start()
htrmnn0y

htrmnn0y1#

您收到的错误消息表明您收到了来自服务器的HTTP 403错误响应代码,这意味着服务器理解请求但拒绝授权。发生这种情况的原因有多种,如scraper配置错误或服务器过载。
要解决此问题,可以尝试以下解决方案:

Change the user agent: Some websites block requests from certain user agents. You can try to change the user agent in your scraper to a common browser user agent. You can do this by adding the following lines to your spider:

用户代理= 'Mozilla/5.0(Windows NT 10.0;Win 64; x64)苹果网络工具包/537.36(KHTML,类似壁虎)Chrome浏览器/58.0.3029.110 Safari浏览器/537.36'
header = { '用户代理':用户代理}
yield scrappy.请求(url,header =标题,callback=自身.解析目录内容)
使用代理:某些网站会阻止来自某些IP地址的请求。您可以尝试使用代理服务器发送请求。您可以通过在爬行器中添加以下行来实现此目的:
代理池启用=真
定义启动请求(自身):对于self.start_url中的url:请求(url,callback=self.解析目录内容, meta={'代理':“http://代理服务器:端口”})
def解析目录内容(自身,响应):#这里的解析逻辑
将“proxyserver”和“port”替换为要使用的代理服务器的IP地址和端口号。

Slow down your requests: Some websites block requests that are sent too frequently. You can try to slow down your requests by adding a delay between requests. You can do this by adding the following line to your spider:

下载延迟= 1
这将在每个请求之间增加1秒的延迟。
请注意,使用代理服务器并不总是法律的或道德的,因此请确保在使用代理服务器之前查看网站的服务条款。

相关问题