我正在遵循scrapy教程here,我正试图将它与我自己的项目链接起来。
我首先通过运行以下命令创建一个项目:
scrapy startproject idealistaScraper
接下来,我转到spiders
文件夹,并使用以下代码创建一个新的python文件:
import scrapy
print("\n", "-"*145, "\n", "-"*60, "Starting the Scrapy bot", "-"*60, "\n", "-"*145, "\n")
class QuotesSpider(scrapy.Spider):
name = "idealistaCollector"
def start_requests(self):
urls = [
'https://www.idealista.com/inmueble/97010777/'
#'https://www.idealista.com/inmueble/97010777/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
另存为:connection_spider.py
.
最后,我运行以下代码
scrapy crawl idealistaCollector
其中,idealistaCollector
是我在connection_spider.py
文件中为刮刀指定的name
。
我得到的输出如下:
-------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------ Starting the Scrapy bot ------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------
2022-04-08 18:42:51 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: idealistaScraper)
2022-04-08 18:42:51 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:59) - [GCC 10.3.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1n 15 Mar 2022), cryptography 36.0.2, Platform Linux-5.3.18-150300.59.49-default-x86_64-with-glibc2.31
2022-04-08 18:42:51 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'idealistaScraper',
'NEWSPIDER_MODULE': 'idealistaScraper.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['idealistaScraper.spiders']}
2022-04-08 18:42:51 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-04-08 18:42:51 [scrapy.extensions.telnet] INFO: Telnet Password: 3ca0ebf8976d6291
2022-04-08 18:42:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-04-08 18:42:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-08 18:42:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-08 18:42:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-08 18:42:51 [scrapy.core.engine] INFO: Spider opened
2022-04-08 18:42:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-08 18:42:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-08 18:42:52 [filelock] DEBUG: Attempting to acquire lock 140245041823504 on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Lock 140245041823504 acquired on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Attempting to acquire lock 140245032917552 on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Lock 140245032917552 acquired on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Attempting to release lock 140245032917552 on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Lock 140245032917552 released on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Attempting to release lock 140245041823504 on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Lock 140245041823504 released on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-04-08 18:42:52 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.idealista.com/robots.txt> (referer: None)
2022-04-08 18:42:52 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.idealista.com/inmueble/97010777/> (referer: None)
2022-04-08 18:42:52 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.idealista.com/inmueble/97010777/>: HTTP status code is not handled or not allowed
2022-04-08 18:42:52 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-08 18:42:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 617,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2461,
'downloader/response_count': 2,
'downloader/response_status_count/403': 2,
'elapsed_time_seconds': 0.353292,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 8, 16, 42, 52, 274906),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/403': 1,
'log_count/DEBUG': 11,
'log_count/INFO': 11,
'memusage/max': 68902912,
'memusage/startup': 68902912,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/403': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 4, 8, 16, 42, 51, 921614)}
2022-04-08 18:42:52 [scrapy.core.engine] INFO: Spider closed (finished)
因此,我的问题是,我如何导航我得到的403
错误?
2022-04-08 18:42:52 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.idealista.com/robots.txt> (referer: None)
2022-04-08 18:42:52 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.idealista.com/inmueble/97010777/> (referer: None)
我也尝试过将以下自定义头文件添加到connection_spider.py
文件中,但仍然没有任何运气。
#### own defined functions ###
desktop_agents = {"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"
}
userAGENT = sample(desktop_agents, 1)
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'es-ES,es;q=0.9,en;q=0.8',
'cache-control': 'max-age=0',
'referer': 'https://www.idealista.com/en/',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': userAGENT
}
print("-"*30, "Using User Agent ", userAGENT)
custom_settings = {
'CONCURRENT_REQUESTS_PER_DOMAIN': 1, # sensitivity to collect data - 1 request per domain
'DOWNLOAD_DELAY': 10 # 1 second download delay
}
编辑:
此外,当我运行:
wget https://www.idealista.com/buscar/venta-oficinas/240/
--2022-04-08 17:30:01-- https://www.idealista.com/buscar/venta-oficinas/240/
Resolviendo www.idealista.com (www.idealista.com)... 151.101.18.137
Conectando con www.idealista.com (www.idealista.com)[151.101.18.137]:443... conectado.
Petición HTTP enviada, esperando respuesta... 403 Forbidden
2022-04-08 17:30:01 ERROR 403: Forbidden.
1条答案
按热度按时间rfbsl7qr1#
我还得到了403使用scrapy的情况下,两个网址:这里和here,但当我使用python
requests
模块,然后它的工作意味着响应状态:200下面是一个例子,你可以检验一下:
输出量: