为什么Scrapy spider的SSL握手在使用OpenSSL的相同custom_settings时失败?

06odsfpq  于 2023-10-19  发布在  其他
关注(0)|答案(1)|浏览(174)

我正在使用Scrapy抓取一个网站,其中服务器的SSL配置有错误。(我无法控制服务器配置)。这会导致Scrapy(或者Twisted?)每次尝试连接时都会产生SSL握手失败,即使使用具有适用于OpenSSL CLI的相同参数的custom_settings,也可以使用Python和SSL进行基本的概念验证。(见下文)。
我做错了什么?Scrapy的STDOUT显示设置覆盖正在生效,但每次握手都失败。
关于服务器SSL问题的根本原因的详细信息在这里。总之,它只接受TLS1.2,并要求客户端提供SHA-1作为签名算法。因此在客户端上下文中需要SECLEVEL=0。

Scrapy输出

(.venv) root@348980730ce9:/ssl_test/ssl_test/spiders# scrapy crawl badsslconfig
2023-08-06 05:40:22 [scrapy.utils.log] INFO: Scrapy 2.10.0 started (bot: ssl_test)
2023-08-06 05:40:22 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.4 (main, Jul 28 2023, 05:02:22) [GCC 12.2.0], pyOpenSSL 23.2.0 (OpenSSL 3.1.2 1 Aug 2023), cryptography 41.0.3, Platform Linux-5.15.49-linuxkit-x86_64-with-glibc2.36
2023-08-06 05:40:22 [scrapy.addons] INFO: Enabled addons:
[]
2023-08-06 05:40:22 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'ssl_test',
 'DOWNLOADER_CLIENT_TLS_CIPHERS': 'DEFAULT:@SECLEVEL=0',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'ssl_test.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['ssl_test.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-08-06 05:40:22 [asyncio] DEBUG: Using selector: EpollSelector
2023-08-06 05:40:22 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-08-06 05:40:22 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2023-08-06 05:40:22 [scrapy.extensions.telnet] INFO: Telnet Password: 52cbcfbfdbe0e1e7
2023-08-06 05:40:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2023-08-06 05:40:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-08-06 05:40:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-08-06 05:40:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-08-06 05:40:23 [scrapy.core.engine] INFO: Spider opened
2023-08-06 05:40:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-08-06 05:40:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.legislation.gov.au/robots.txt> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.legislation.gov.au/robots.txt> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.legislation.gov.au/robots.txt> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET https://www.legislation.gov.au/robots.txt>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
Traceback (most recent call last):
  File "/.venv/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.legislation.gov.au> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.legislation.gov.au> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.legislation.gov.au> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.legislation.gov.au>
Traceback (most recent call last):
  File "/.venv/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', '', 'sslv3 alert handshake failure')]>]
2023-08-06 05:40:23 [scrapy.core.engine] INFO: Closing spider (finished)
2023-08-06 05:40:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 6,
 'downloader/request_bytes': 1368,
 'downloader/request_count': 6,
 'downloader/request_method_count/GET': 6,
 'elapsed_time_seconds': 0.612751,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 8, 6, 5, 40, 23, 702263),
 'log_count/DEBUG': 7,
 'log_count/ERROR': 4,
 'log_count/INFO': 10,
 'memusage/max': 64425984,
 'memusage/startup': 64425984,
 'retry/count': 4,
 'retry/max_reached': 2,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4,
 "robotstxt/exception_count/<class 'twisted.web._newclient.ResponseNeverReceived'>": 1,
 'robotstxt/request_count': 1,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2023, 8, 6, 5, 40, 23, 89512)}
2023-08-06 05:40:23 [scrapy.core.engine] INFO: Spider closed (finished)

版本信息

root@348980730ce9:/ssl_test/ssl_test/spiders# scrapy version -v
Scrapy       : 2.10.0
lxml         : 4.9.3.0
libxml2      : 2.10.3
cssselect    : 1.2.0
parsel       : 1.8.1
w3lib        : 2.1.2
Twisted      : 22.10.0
Python       : 3.11.4 (main, Jul 28 2023, 05:02:22) [GCC 12.2.0]
pyOpenSSL    : 23.2.0 (OpenSSL 3.1.2 1 Aug 2023)
cryptography : 41.0.3
Platform     : Linux-5.15.49-linuxkit-x86_64-with-glibc2.36
(.venv) root@348980730ce9:/ssl_test/ssl_test/spiders# which openssl
/usr/bin/openssl
(.venv) root@348980730ce9:/ssl_test/ssl_test/spiders# openssl version -v

OpenSSL握手成功

(.venv) root@348980730ce9:/ssl_test/ssl_test/spiders# openssl s_client -connect 54.66.220.183:443 -cipher 'DEFAULT:@SECLEVEL=0'
CONNECTED(00000003)
Can't use SSL_get_servername
depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
verify return:1
depth=1 C = US, O = "DigiCert, Inc.", CN = RapidSSL Global TLS RSA4096 SHA256 2022 CA1
verify return:1
depth=0 CN = *.legislation.gov.au
verify return:1
---
Certificate chain
 0 s:CN = *.legislation.gov.au
   i:C = US, O = "DigiCert, Inc.", CN = RapidSSL Global TLS RSA4096 SHA256 2022 CA1
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Jan 30 00:00:00 2023 GMT; NotAfter: Feb 11 23:59:59 2024 GMT
 1 s:C = US, O = "DigiCert, Inc.", CN = RapidSSL Global TLS RSA4096 SHA256 2022 CA1
   i:C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
   a:PKEY: rsaEncryption, 4096 (bit); sigalg: RSA-SHA256
   v:NotBefore: May  4 00:00:00 2022 GMT; NotAfter: Nov  9 23:59:59 2031 GMT
 2 s:C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
   i:C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA1
   v:NotBefore: Nov 10 00:00:00 2006 GMT; NotAfter: Nov 10 00:00:00 2031 GMT
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIHqDCCBZCgAwIBAgIQCD51iO5LTch3btyrzg48bzANBgkqhkiG9w0BAQsFADBc
MQswCQYDVQQGEwJVUzEXMBUGA1UEChMORGlnaUNlcnQsIEluYy4xNDAyBgNVBAMT
K1JhcGlkU1NMIEdsb2JhbCBUTFMgUlNBNDA5NiBTSEEyNTYgMjAyMiBDQTEwHhcN
MjMwMTMwMDAwMDAwWhcNMjQwMjExMjM1OTU5WjAfMR0wGwYDVQQDDBQqLmxlZ2lz
bGF0aW9uLmdvdi5hdTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBANrx
FvQbBE9bnuXZiHrdR7mB1tkiWLTHhoAq00uAffKkS6bkM1Gs7OuO5XKBP0LlBPll
bgn/DJ5pXlZKX3nqhjV3x/nJRRqAf3EdvrDMTRbj4zyxQ+4zQ0V8sOVcU5HJddcu
yNQek1LLhXf5tpWpd+RsP5V7CZlIHLl3PyrCuCsugv4SKnGh1Xm0QrHB/NrpNz8w
J1hTQTP6NlO7KiVs92BQ6ZXTl1ZD5mmgg5muDo0kpNN2inzv2BJvdH4KCEw5bTAq
EmcWXM+vHoQA0acFEMwwxr8iT/1keaKAwRabg9PiWqDdA13egKNQAqUIDK1dF/eM
pf8X75arHZxkk2+CMjMCAwEAAaOCA6EwggOdMB8GA1UdIwQYMBaAFPCchf2in32P
yWi71dSJTR2+05D/MB0GA1UdDgQWBBRcWwBAEE3RJ6flW6Mf5kraJAjWRDAzBgNV
HREELDAqghQqLmxlZ2lzbGF0aW9uLmdvdi5hdYISbGVnaXNsYXRpb24uZ292LmF1
MA4GA1UdDwEB/wQEAwIFoDAdBgNVHSUEFjAUBggrBgEFBQcDAQYIKwYBBQUHAwIw
gZ8GA1UdHwSBlzCBlDBIoEagRIZCaHR0cDovL2NybDMuZGlnaWNlcnQuY29tL1Jh
cGlkU1NMR2xvYmFsVExTUlNBNDA5NlNIQTI1NjIwMjJDQTEuY3JsMEigRqBEhkJo
dHRwOi8vY3JsNC5kaWdpY2VydC5jb20vUmFwaWRTU0xHbG9iYWxUTFNSU0E0MDk2
U0hBMjU2MjAyMkNBMS5jcmwwPgYDVR0gBDcwNTAzBgZngQwBAgEwKTAnBggrBgEF
BQcCARYbaHR0cDovL3d3dy5kaWdpY2VydC5jb20vQ1BTMIGHBggrBgEFBQcBAQR7
MHkwJAYIKwYBBQUHMAGGGGh0dHA6Ly9vY3NwLmRpZ2ljZXJ0LmNvbTBRBggrBgEF
BQcwAoZFaHR0cDovL2NhY2VydHMuZGlnaWNlcnQuY29tL1JhcGlkU1NMR2xvYmFs
VExTUlNBNDA5NlNIQTI1NjIwMjJDQTEuY3J0MAkGA1UdEwQCMAAwggF+BgorBgEE
AdZ5AgQCBIIBbgSCAWoBaAB2AO7N0GTV2xrOxVy3nbTNE6Iyh0Z8vOzew1FIWUZx
H7WbAAABhgUN3XoAAAQDAEcwRQIhAIuzKlDiXLZitacpPcnjPr+ivxEwoh3PVaSm
6cSs0ufWAiAeCWS3fTLXwi9X1BFpZqGlyUVwo+GGsBVf48TtfRTrcgB2AHPZnokb
TJZ4oCB9R53mssYc0FFecRkqjGuAEHrBd3K1AAABhgUN3Z8AAAQDAEcwRQIhAKOm
Ht0FHIjxWfNvxQ5hsAxAhnMD+E6vN+VtOItO+JMIAiBKKW5bNxkrTVH8UJmo688w
Nzq6mifm0HpqA7zcX3W8MAB2AEiw42vapkc0D+VqAvqdMOscUgHLVt0sgdm7v6s5
2IRzAAABhgUN3WUAAAQDAEcwRQIhAPt7qx6WI7D2Ohuiw12Y6Wdak9SyfP47tDXF
ygquEtgeAiA7DSooWXRKaVjCWX75kDCt70PoA6MJd2xb6qZyTfV0DDANBgkqhkiG
9w0BAQsFAAOCAgEAHdZISuK409QEVnClR0w3Hwkeca/uoRADtvNUg69Ei6oHhEZw
tb1FvXPxhdXEU6409a9mNdjcmLDg+5Cfo9zVWpneL2vg+qcbbsq7W31WjA7DWoHV
HjRSzoYzd9SGsGGOMmqXlOFtLVhkBJTdxb7DyVMTZxZoKIzL5EXqj9VykYB+nAm2
Xv8+xcTBzoaF5OhvVQ78K2I1X5rjDwIsrbpCBpB6MUAiLsmBDY5F+mXnFIG+8Jxk
OLmJ88pQWblLRub59xBC5i2+qXSNqyAJKcIY3HUGpA+f/KT5f7K5DMMlecxPpBJW
eLzlXzOXE8vYezKtazhMdi8eO2zEVedAY8BmvGcoHFMFIcfZ9Bbno5qSiGb5WIfw
oxupuQtvtTg6oBtN7vHanBtc4+EVaQrKmQ2VnTRug4PTGTUcRaFmWY0d5+pfiSbo
v7zW5tVOl6Whu9+alcAAl5L1kZwrGPwWYXazDf4Q6lh2mLToA/b4AFQRmKDCpa1X
HIXNpAHbBKBNXGUfK1Ky9ZEtJpOAi0fPRwVGRwR2mzAdE+rzz6ARSWn5+xaStqtm
ImflxSVn2YI041tBguWayCw4du+iOFVBpdPzEiMOyJ95L+XngAZCwc296hnkljiL
8wRteqCkwMMXpVfHSTDopMKPndZ3k99Hv/XSHAqQ0xXYspoLNlhjtNf0ELA=
-----END CERTIFICATE-----
subject=CN = *.legislation.gov.au
issuer=C = US, O = "DigiCert, Inc.", CN = RapidSSL Global TLS RSA4096 SHA256 2022 CA1
---
No client certificate CA names sent
---
SSL handshake has read 4565 bytes and written 621 bytes
Verification: OK
---
New, TLSv1.2, Cipher is AES128-GCM-SHA256
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : AES128-GCM-SHA256
    Session-ID: 35C8A1175ABF47501236C0C9B171BCD21F973C8C745E5D2377851B53DE62ED60
    Session-ID-ctx: 
    Master-Key: DB0F145BB6A858F762CE4ED39E19F77C531B91A41CDED14E8A96377F688A9BA1A5B3386FE83017A83F4B99CEDBFEDDCD
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1691299616
    Timeout   : 7200 (sec)
    Verify return code: 0 (ok)
    Extended master secret: no
---
closed

注意:我故意使用IP地址而不是主机名,因为有一些IPv6服务器共享相同的名称,这些服务器似乎配置得很好。

重建步骤:

1.从Docker Hub部署新的默认Python容器

  1. pip install scrapy
  2. scrapy createproject
  3. scrapy genspider https://www.legislation.gov.au
    1.将custom_settings添加到.py sider类定义中:
custom_settings = {
        'DOWNLOADER_CLIENT_TLS_METHOD' : 'TLSv1.2',
        'DOWNLOADER_CLIENT_TLS_CIPHERS' : 'DEFAULT:@SECLEVEL=0'}

尝试的其他故障排除步骤:

  • 降级到OpenSSL 1.1.1
  • 使用python和SSL进行概念验证(即bypass Scrapy dependencies):
import ssl, socket
hostname = 'legislation.gov.au'
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
context.set_ciphers('DEFAULT:@SECLEVEL=0')

context.check_hostname=False
context.verify_mode =ssl.CERT_NONE
# It's not important to authenticate the server for the moment.

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
ssl_sock = context.wrap_socket(s, server_hostname=hostname)
ssl_sock.connect((hostname, 443))

这和预期的一样,表明问题出在Scrapy的实现或其依赖项中的某个地方。

  • 在其他平台(MacOS)上测试:相同错误

Scrapy Spider定义(所有其他文件均为默认文件):

class BadsslconfigSpider(scrapy.Spider):
name = "badsslconfig"
allowed_domains = ["www.legislation.gov.au"]
start_urls = ["https://www.legislation.gov.au"]

custom_settings = {

    'DOWNLOADER_CLIENT_TLS_CIPHERS' : 'DEFAULT:@SECLEVEL=0',
    }

def parse(self, response):
    pass
6ojccjat

6ojccjat1#

TL;DR:看起来SECLEVEL信息被Twisted丢弃了,这是scrappy用来处理包括TLS在内的I/O的库。
详细内容:
根据代码中的一些调试,看起来twisted在设置密码之前扩展了密码字符串,方法是使用set_cipher_list将密码字符串设置到SSL上下文中,然后使用get_cipher_list从上下文中阅读密码。由于SECLEVEL不是一个实际的密码,它被这样丢弃。来自SECLEVEL的信息仍然包含在使用的SSLContext中,但不幸的是,这个SSLContext只是临时用于获取扩展的密码列表,在进行连接时并没有实际使用。有关更多信息,请参见_expandCipherString。
在进行数据包捕获和分析signature_algorithms扩展时也可以观察到这种结果。因为SECLEVEL=0,所以SHA-1应该在那里,这也是(损坏的)服务器为了正常工作所期望的。但是可以看出,SHA-1不在其中,即。SECLEVEL被忽略。
除了深入了解Twisted本身之外,我看不到其他变通方法。一个快速但肮脏的方法是在调用set_cipher_list时添加@SECLEVEL=0。因此,在_sslverify.py中,

ctx.set_cipher_list(self._cipherString.encode("ascii"))

这样做:

ctx.set_cipher_list(self._cipherString.encode("ascii") + b":@SECLEVEL=0")

作为bug提交到Twisted -https://github.com/twisted/twisted/issues/11903

相关问题