Heroku生产部署中使用请求和代理的奇怪错误

zqdjd7g9 于 2023-10-19 发布在其他

关注(0)|答案(1)|浏览(98)

我做了一个应用程序，使用代理和检查未索引的网站使用python中的requests模块。我刮谷歌结果页，www.google.com/search?site:{url}&num=3和检查一个特定的短语时，谷歌无法找到该特定的网站！

# checking logic
            response = self.proxy_request(INDEXING_SEARCH_STRING.format(current_url))
            if response.status_code != 200:
                return current_url, False, "failed"
            soup = bs4.BeautifulSoup(response.text, "html.parser")
            not_indexed_regex = re.compile("did not match any documents")
            if soup(text=not_indexed_regex):
                return current_url, False, "checked"
            else:
                print(response.text)
                return current_url, True, "checked"

# proxy requests
    def proxy_request(self, url, **kwargs):
        fail_count = 0
        max_failures = 3  # Adjust this threshold as needed
        print("Evaluating: ", self.url_manager.current_url_index, "URL: ", url)
        while fail_count < max_failures:
            current_proxy = self.proxy_manager.get_proxy_for_request()

            if current_proxy is None:
                ProgressManager.update_progress("All given proxy failed")
                return requests.get(url, **kwargs)
            
            try:
                response = requests.get(url, proxies=current_proxy, timeout=20)
                if response.status_code == 200:
                    print("Success!")
                    self.proxy_manager.update_proxy()
                    return response
                else:
                    print("Failed!",response.status_code)
                    ProgressManager.update_progress("Proxy failing with status code: " + str(response.status_code))
                    time.sleep(0.5)
                    self.proxy_manager.update_proxy()
            except Exception as e:
                print("Failed!", e)
                fail_count += 1
                self.proxy_manager.update_proxy()
                ProgressManager.update_progress(f"Request failed! {e.__class__.__name__}. ")
                break
        time.sleep(5)
        return requests.get(url,timeout=20)

在我的本地机器上，它在有/没有代理的情况下工作得很好。但是当我在Heroku上部署它时，它会在没有索引的情况下将一些网站标记为True，"checked"，这是由我设备上运行的同一应用程序正确处理的。
然而，当代理没有给出时，它可以正常工作，当代理提交给它时就会出现错误。
此外，如果有任何其他更简单的方法来绕过H-12超时错误的长期运行的过程中，不需要任何额外的服务器运行，请让我知道。
它在localhost上工作，所以我无法有效地调试部署。有时代理HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=site:{{URL}}/&num=1 (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 401 Auth Failed ip_blacklisted: 3.85.57.0/24')))也会出现错误，如何解决？

heroku

来源：https://stackoverflow.com/questions/76943525/weird-bugs-on-heroku-production-deployment-with-requests-and-proxies

1条答案

按热度按时间

smtd7mpg1#

我发现结果是用不同的语言给出的。因此，指定的模式did not match any documents可能会出现，也可能不会出现。
一个简单的解决方案是使用修改后的google query，www.google.com/search?site:{url}&num=3&hl=en的hl=en部分将强制google返回英文页面。

赞(0）回复(0）举报 2023-10-19

我来回答

Heroku生产部署中使用请求和代理的奇怪错误

1条答案

相关问题

热门标签

最新问答