我发现了这段代码,对我来说它似乎是可靠和高效的,但不幸的是,它是针对python2的,而且它使用了urllib2,而每个人都说请求更快。在python3中,下面的代码(或者更有效或更可靠的代码)是什么?
#!/usr/bin/env python
#-*- coding:utf-8 -*-
import sys
import urllib2
# This script uses HEAD requests (with fallback in case of 405)
# to follow the redirect path up to the real URL
# (c) 2012 Filippo Valsorda - FiloSottile
# Released under the GPL license
class HeadRequest(urllib2.Request):
def get_method(self):
return "HEAD"
class HEADRedirectHandler(urllib2.HTTPRedirectHandler):
"""
Subclass the HTTPRedirectHandler to make it use our
HeadRequest also on the redirected URL
"""
def redirect_request(self, req, fp, code, msg, headers, newurl):
if code in (301, 302, 303, 307):
newurl = newurl.replace(' ', '%20')
newheaders = dict((k,v) for k,v in req.headers.items()
if k.lower() not in ("content-length", "content-type"))
return HeadRequest(newurl,
headers=newheaders,
origin_req_host=req.get_origin_req_host(),
unverifiable=True)
else:
raise urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)
class HTTPMethodFallback(urllib2.BaseHandler):
"""
Fallback to GET if HEAD is not allowed (405 HTTP error)
"""
def http_error_405(self, req, fp, code, msg, headers):
fp.read()
fp.close()
newheaders = dict((k,v) for k,v in req.headers.items()
if k.lower() not in ("content-length", "content-type"))
return self.parent.open(urllib2.Request(req.get_full_url(),
headers=newheaders,
origin_req_host=req.get_origin_req_host(),
unverifiable=True))
# Build our opener
opener = urllib2.OpenerDirector()
for handler in [urllib2.HTTPHandler, urllib2.HTTPDefaultErrorHandler,
HTTPMethodFallback, HEADRedirectHandler,
urllib2.HTTPErrorProcessor, urllib2.HTTPSHandler]:
opener.add_handler(handler())
response = opener.open(HeadRequest(sys.argv[1]))
print(response.geturl())
字符串
顺便说一句,头的请求实际上不是我需要的。我只想知道如果链接是坏的(在一些网站,如果你给予他们一个坏代码,他们会重定向你回到网站的主页,我希望我的代码也认识到这一点)和头请求是最有效的解决方案,来到我的脑海中,所以如果你知道任何更好的方法,我也会感激。
1条答案
按热度按时间f8rj6qna1#
看看请求:http://docs.python-requests.org/en/master/
要执行一个HEAD请求,只需执行:
字符串
然后,您可以访问该对象以获取所需的内容。例如,状态代码:
型
更新:如果你想检查一个页面是否是活动的,你需要执行GET请求。我见过这样的情况:HEAD请求返回
200
响应,而在同一个URL上,GET请求返回500