scrapy 调试:在Python 3中爬网(403)网页搜罗

dz6r00yl  于 2022-11-09  发布在  Python
关注(0)|答案(1)|浏览(229)

我试图刮一个网站的实践,但我一直得到HTTP 403错误。如何获得请求权限?
下面是我的代码:

from typing import List

import scrapy

class ResearchSpider(scrapy.Spider):
    name = 'pesquisa'
    start_urls: list[str] = ['https://www.imovelweb.com.br/imoveis-aluguel-paraiba.html?iv_=__iv_p_1_a_17808488596_g_139189246037_w_dsa-1687663569069_h_20089_ii_20098_d_c_v__n_g_c_611609016411_k__m__l__t__e__r__vi__']

    def parse(self, response):
        for pesquisa in response.css('.js-listing-labels-link'):
            yield{
                'address': pesquisa.css('.property-card__address::text').get(),
                'area': pesquisa.css('.js-property-card-detail-area::text').get(),
                'rooms': pesquisa.css('.js-property-detail-rooms .js-property-card-value::text').get(),
                'bathroom': pesquisa.css('.js-property-detail-bathroom .js-property-card-value::text').get(),
                'garages': pesquisa.css('.js-property-detail-garages .js-property-card-value::text').get(),
                'prices': pesquisa.css('p::text').get()[5:-1]}

终端要执行命令:

scrapy shell
fetch('https://www.imovelweb.com.br/imoveis-paraiba.html')

我得到的错误是:

2022-09-16 14:11:14 [filelock] DEBUG: Attempting to release lock 1292395054608 on C:\Users\Familia\anaconda3\lib\site-packages\tldextract\.suffix_cache/pub
licsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-09-16 14:11:14 [filelock] DEBUG: Lock 1292395054608 released on C:\Users\Familia\anaconda3\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org
-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-09-16 14:11:16 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.imovelweb.com.br/imoveis-paraiba.html> (referer: None)
uwopmtnx

uwopmtnx1#

该 网站 受 cloudflare 保护 。

https://www.imovelweb.com.br/imoveis-aluguel-paraiba.html?iv_=__iv_p_1_a_17808488596_g_139189246037_w_dsa-1687663569069_h_20089_ii_20098_d_c_v__n_g_c_611609016411_k__m__l__t__e__r__vi__ is using Cloudflare CDN/Proxy!

https://www.imovelweb.com.br/imoveis-aluguel-paraiba.html?iv_=__iv_p_1_a_17808488596_g_139189246037_w_dsa-1687663569069_h_20089_ii_20098_d_c_v__n_g_c_611609016411_k__m__l__t__e__r__vi__ is using Cloudflare SSL!

中 的 每 一 个

相关问题