我正在使用Scrapy抓取一个网站,其中服务器的SSL配置有错误。(我无法控制服务器配置)。这会导致Scrapy(或者Twisted?)每次尝试连接时都会产生SSL握手失败,即使使用具有适用于OpenSSL CLI的相同参数的custom_settings,也可以使用Python和SSL进行基本的概念验证。(见下文)。
我做错了什么?Scrapy的STDOUT显示设置覆盖正在生效,但每次握手都失败。
关于服务器SSL问题的根本原因的详细信息在这里。总之,它只接受TLS1.2,并要求客户端提供SHA-1作为签名算法。因此在客户端上下文中需要SECLEVEL=0。
Scrapy输出
(.venv) root@348980730ce9:/ssl_test/ssl_test/spiders# scrapy crawl badsslconfig
2023-08-06 05:40:22 [scrapy.utils.log] INFO: Scrapy 2.10.0 started (bot: ssl_test)
2023-08-06 05:40:22 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.4 (main, Jul 28 2023, 05:02:22) [GCC 12.2.0], pyOpenSSL 23.2.0 (OpenSSL 3.1.2 1 Aug 2023), cryptography 41.0.3, Platform Linux-5.15.49-linuxkit-x86_64-with-glibc2.36
2023-08-06 05:40:22 [scrapy.addons] INFO: Enabled addons:
[]
2023-08-06 05:40:22 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'ssl_test',
'DOWNLOADER_CLIENT_TLS_CIPHERS': 'DEFAULT:@SECLEVEL=0',
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'ssl_test.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['ssl_test.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-08-06 05:40:22 [asyncio] DEBUG: Using selector: EpollSelector
2023-08-06 05:40:22 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-08-06 05:40:22 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2023-08-06 05:40:22 [scrapy.extensions.telnet] INFO: Telnet Password: 52cbcfbfdbe0e1e7
2023-08-06 05:40:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2023-08-06 05:40:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-08-06 05:40:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-08-06 05:40:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-08-06 05:40:23 [scrapy.core.engine] INFO: Spider opened
2023-08-06 05:40:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-08-06 05:40:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.legislation.gov.au/robots.txt> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.legislation.gov.au/robots.txt> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.legislation.gov.au/robots.txt> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET https://www.legislation.gov.au/robots.txt>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
Traceback (most recent call last):
File "/.venv/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
return (yield download_func(request=request, spider=spider))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.legislation.gov.au> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.legislation.gov.au> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.legislation.gov.au> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.legislation.gov.au>
Traceback (most recent call last):
File "/.venv/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
return (yield download_func(request=request, spider=spider))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.core.engine] INFO: Closing spider (finished)
2023-08-06 05:40:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 6,
'downloader/request_bytes': 1368,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'elapsed_time_seconds': 0.612751,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 8, 6, 5, 40, 23, 702263),
'log_count/DEBUG': 7,
'log_count/ERROR': 4,
'log_count/INFO': 10,
'memusage/max': 64425984,
'memusage/startup': 64425984,
'retry/count': 4,
'retry/max_reached': 2,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4,
"robotstxt/exception_count/<class 'twisted.web._newclient.ResponseNeverReceived'>": 1,
'robotstxt/request_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2023, 8, 6, 5, 40, 23, 89512)}
2023-08-06 05:40:23 [scrapy.core.engine] INFO: Spider closed (finished)
版本信息:
root@348980730ce9:/ssl_test/ssl_test/spiders# scrapy version -v
Scrapy : 2.10.0
lxml : 4.9.3.0
libxml2 : 2.10.3
cssselect : 1.2.0
parsel : 1.8.1
w3lib : 2.1.2
Twisted : 22.10.0
Python : 3.11.4 (main, Jul 28 2023, 05:02:22) [GCC 12.2.0]
pyOpenSSL : 23.2.0 (OpenSSL 3.1.2 1 Aug 2023)
cryptography : 41.0.3
Platform : Linux-5.15.49-linuxkit-x86_64-with-glibc2.36
(.venv) root@348980730ce9:/ssl_test/ssl_test/spiders# which openssl
/usr/bin/openssl
(.venv) root@348980730ce9:/ssl_test/ssl_test/spiders# openssl version -v
OpenSSL握手成功
(.venv) root@348980730ce9:/ssl_test/ssl_test/spiders# openssl s_client -connect 54.66.220.183:443 -cipher 'DEFAULT:@SECLEVEL=0'
CONNECTED(00000003)
Can't use SSL_get_servername
depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
verify return:1
depth=1 C = US, O = "DigiCert, Inc.", CN = RapidSSL Global TLS RSA4096 SHA256 2022 CA1
verify return:1
depth=0 CN = *.legislation.gov.au
verify return:1
---
Certificate chain
0 s:CN = *.legislation.gov.au
i:C = US, O = "DigiCert, Inc.", CN = RapidSSL Global TLS RSA4096 SHA256 2022 CA1
a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
v:NotBefore: Jan 30 00:00:00 2023 GMT; NotAfter: Feb 11 23:59:59 2024 GMT
1 s:C = US, O = "DigiCert, Inc.", CN = RapidSSL Global TLS RSA4096 SHA256 2022 CA1
i:C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
a:PKEY: rsaEncryption, 4096 (bit); sigalg: RSA-SHA256
v:NotBefore: May 4 00:00:00 2022 GMT; NotAfter: Nov 9 23:59:59 2031 GMT
2 s:C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
i:C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA1
v:NotBefore: Nov 10 00:00:00 2006 GMT; NotAfter: Nov 10 00:00:00 2031 GMT
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIHqDCCBZCgAwIBAgIQCD51iO5LTch3btyrzg48bzANBgkqhkiG9w0BAQsFADBc
MQswCQYDVQQGEwJVUzEXMBUGA1UEChMORGlnaUNlcnQsIEluYy4xNDAyBgNVBAMT
K1JhcGlkU1NMIEdsb2JhbCBUTFMgUlNBNDA5NiBTSEEyNTYgMjAyMiBDQTEwHhcN
MjMwMTMwMDAwMDAwWhcNMjQwMjExMjM1OTU5WjAfMR0wGwYDVQQDDBQqLmxlZ2lz
bGF0aW9uLmdvdi5hdTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBANrx
FvQbBE9bnuXZiHrdR7mB1tkiWLTHhoAq00uAffKkS6bkM1Gs7OuO5XKBP0LlBPll
bgn/DJ5pXlZKX3nqhjV3x/nJRRqAf3EdvrDMTRbj4zyxQ+4zQ0V8sOVcU5HJddcu
yNQek1LLhXf5tpWpd+RsP5V7CZlIHLl3PyrCuCsugv4SKnGh1Xm0QrHB/NrpNz8w
J1hTQTP6NlO7KiVs92BQ6ZXTl1ZD5mmgg5muDo0kpNN2inzv2BJvdH4KCEw5bTAq
EmcWXM+vHoQA0acFEMwwxr8iT/1keaKAwRabg9PiWqDdA13egKNQAqUIDK1dF/eM
pf8X75arHZxkk2+CMjMCAwEAAaOCA6EwggOdMB8GA1UdIwQYMBaAFPCchf2in32P
yWi71dSJTR2+05D/MB0GA1UdDgQWBBRcWwBAEE3RJ6flW6Mf5kraJAjWRDAzBgNV
HREELDAqghQqLmxlZ2lzbGF0aW9uLmdvdi5hdYISbGVnaXNsYXRpb24uZ292LmF1
MA4GA1UdDwEB/wQEAwIFoDAdBgNVHSUEFjAUBggrBgEFBQcDAQYIKwYBBQUHAwIw
gZ8GA1UdHwSBlzCBlDBIoEagRIZCaHR0cDovL2NybDMuZGlnaWNlcnQuY29tL1Jh
cGlkU1NMR2xvYmFsVExTUlNBNDA5NlNIQTI1NjIwMjJDQTEuY3JsMEigRqBEhkJo
dHRwOi8vY3JsNC5kaWdpY2VydC5jb20vUmFwaWRTU0xHbG9iYWxUTFNSU0E0MDk2
U0hBMjU2MjAyMkNBMS5jcmwwPgYDVR0gBDcwNTAzBgZngQwBAgEwKTAnBggrBgEF
BQcCARYbaHR0cDovL3d3dy5kaWdpY2VydC5jb20vQ1BTMIGHBggrBgEFBQcBAQR7
MHkwJAYIKwYBBQUHMAGGGGh0dHA6Ly9vY3NwLmRpZ2ljZXJ0LmNvbTBRBggrBgEF
BQcwAoZFaHR0cDovL2NhY2VydHMuZGlnaWNlcnQuY29tL1JhcGlkU1NMR2xvYmFs
VExTUlNBNDA5NlNIQTI1NjIwMjJDQTEuY3J0MAkGA1UdEwQCMAAwggF+BgorBgEE
AdZ5AgQCBIIBbgSCAWoBaAB2AO7N0GTV2xrOxVy3nbTNE6Iyh0Z8vOzew1FIWUZx
H7WbAAABhgUN3XoAAAQDAEcwRQIhAIuzKlDiXLZitacpPcnjPr+ivxEwoh3PVaSm
6cSs0ufWAiAeCWS3fTLXwi9X1BFpZqGlyUVwo+GGsBVf48TtfRTrcgB2AHPZnokb
TJZ4oCB9R53mssYc0FFecRkqjGuAEHrBd3K1AAABhgUN3Z8AAAQDAEcwRQIhAKOm
Ht0FHIjxWfNvxQ5hsAxAhnMD+E6vN+VtOItO+JMIAiBKKW5bNxkrTVH8UJmo688w
Nzq6mifm0HpqA7zcX3W8MAB2AEiw42vapkc0D+VqAvqdMOscUgHLVt0sgdm7v6s5
2IRzAAABhgUN3WUAAAQDAEcwRQIhAPt7qx6WI7D2Ohuiw12Y6Wdak9SyfP47tDXF
ygquEtgeAiA7DSooWXRKaVjCWX75kDCt70PoA6MJd2xb6qZyTfV0DDANBgkqhkiG
9w0BAQsFAAOCAgEAHdZISuK409QEVnClR0w3Hwkeca/uoRADtvNUg69Ei6oHhEZw
tb1FvXPxhdXEU6409a9mNdjcmLDg+5Cfo9zVWpneL2vg+qcbbsq7W31WjA7DWoHV
HjRSzoYzd9SGsGGOMmqXlOFtLVhkBJTdxb7DyVMTZxZoKIzL5EXqj9VykYB+nAm2
Xv8+xcTBzoaF5OhvVQ78K2I1X5rjDwIsrbpCBpB6MUAiLsmBDY5F+mXnFIG+8Jxk
OLmJ88pQWblLRub59xBC5i2+qXSNqyAJKcIY3HUGpA+f/KT5f7K5DMMlecxPpBJW
eLzlXzOXE8vYezKtazhMdi8eO2zEVedAY8BmvGcoHFMFIcfZ9Bbno5qSiGb5WIfw
oxupuQtvtTg6oBtN7vHanBtc4+EVaQrKmQ2VnTRug4PTGTUcRaFmWY0d5+pfiSbo
v7zW5tVOl6Whu9+alcAAl5L1kZwrGPwWYXazDf4Q6lh2mLToA/b4AFQRmKDCpa1X
HIXNpAHbBKBNXGUfK1Ky9ZEtJpOAi0fPRwVGRwR2mzAdE+rzz6ARSWn5+xaStqtm
ImflxSVn2YI041tBguWayCw4du+iOFVBpdPzEiMOyJ95L+XngAZCwc296hnkljiL
8wRteqCkwMMXpVfHSTDopMKPndZ3k99Hv/XSHAqQ0xXYspoLNlhjtNf0ELA=
-----END CERTIFICATE-----
subject=CN = *.legislation.gov.au
issuer=C = US, O = "DigiCert, Inc.", CN = RapidSSL Global TLS RSA4096 SHA256 2022 CA1
---
No client certificate CA names sent
---
SSL handshake has read 4565 bytes and written 621 bytes
Verification: OK
---
New, TLSv1.2, Cipher is AES128-GCM-SHA256
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
Protocol : TLSv1.2
Cipher : AES128-GCM-SHA256
Session-ID: 35C8A1175ABF47501236C0C9B171BCD21F973C8C745E5D2377851B53DE62ED60
Session-ID-ctx:
Master-Key: DB0F145BB6A858F762CE4ED39E19F77C531B91A41CDED14E8A96377F688A9BA1A5B3386FE83017A83F4B99CEDBFEDDCD
PSK identity: None
PSK identity hint: None
SRP username: None
Start Time: 1691299616
Timeout : 7200 (sec)
Verify return code: 0 (ok)
Extended master secret: no
---
closed
注意:我故意使用IP地址而不是主机名,因为有一些IPv6服务器共享相同的名称,这些服务器似乎配置得很好。
重建步骤:
1.从Docker Hub部署新的默认Python容器
- pip install scrapy
- scrapy createproject
- scrapy genspider https://www.legislation.gov.au
1.将custom_settings添加到.py sider类定义中:
custom_settings = {
'DOWNLOADER_CLIENT_TLS_METHOD' : 'TLSv1.2',
'DOWNLOADER_CLIENT_TLS_CIPHERS' : 'DEFAULT:@SECLEVEL=0'}
尝试的其他故障排除步骤:
- 降级到OpenSSL 1.1.1
- 使用python和SSL进行概念验证(即bypass Scrapy dependencies):
import ssl, socket
hostname = 'legislation.gov.au'
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
context.set_ciphers('DEFAULT:@SECLEVEL=0')
context.check_hostname=False
context.verify_mode =ssl.CERT_NONE
# It's not important to authenticate the server for the moment.
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
ssl_sock = context.wrap_socket(s, server_hostname=hostname)
ssl_sock.connect((hostname, 443))
这和预期的一样,表明问题出在Scrapy的实现或其依赖项中的某个地方。
- 在其他平台(MacOS)上测试:相同错误
Scrapy Spider定义(所有其他文件均为默认文件):
class BadsslconfigSpider(scrapy.Spider):
name = "badsslconfig"
allowed_domains = ["www.legislation.gov.au"]
start_urls = ["https://www.legislation.gov.au"]
custom_settings = {
'DOWNLOADER_CLIENT_TLS_CIPHERS' : 'DEFAULT:@SECLEVEL=0',
}
def parse(self, response):
pass
1条答案
按热度按时间6ojccjat1#
TL;DR:看起来SECLEVEL信息被Twisted丢弃了,这是scrappy用来处理包括TLS在内的I/O的库。
详细内容:
根据代码中的一些调试,看起来twisted在设置密码之前扩展了密码字符串,方法是使用
set_cipher_list
将密码字符串设置到SSL上下文中,然后使用get_cipher_list
从上下文中阅读密码。由于SECLEVEL不是一个实际的密码,它被这样丢弃。来自SECLEVEL的信息仍然包含在使用的SSLContext中,但不幸的是,这个SSLContext只是临时用于获取扩展的密码列表,在进行连接时并没有实际使用。有关更多信息,请参见_expandCipherString。在进行数据包捕获和分析signature_algorithms扩展时也可以观察到这种结果。因为SECLEVEL=0,所以SHA-1应该在那里,这也是(损坏的)服务器为了正常工作所期望的。但是可以看出,SHA-1不在其中,即。SECLEVEL被忽略。
除了深入了解Twisted本身之外,我看不到其他变通方法。一个快速但肮脏的方法是在调用
set_cipher_list
时添加@SECLEVEL=0
。因此,在_sslverify.py中,这样做:
作为bug提交到Twisted -https://github.com/twisted/twisted/issues/11903