python selenium 无头:如何使用Selenium绕过Cloudflare检测

hs1ihplo  于 2023-05-16  发布在  Python
关注(0)|答案(5)|浏览(1203)

希望Maven能帮助我解决Selenium/Cloudflare之谜。我可以让一个网站在正常(非无头)Selenium中加载,但无论我怎么尝试,我都不能让它在无头中加载。
我遵循了StackOverflow帖子的建议,如Is there a version of Selenium WebDriver that is not detectable?。我还查看了windowwindow.navigator对象的所有属性,并修复了headless和nonheadless之间的所有差异,但不知何故headless仍然被检测到。在这一点上,我非常好奇Cloudflare如何能够找出差异。谢谢你的时间!

我尝试过的事情列表:

  • 用户代理
  • cdc_替换为chromedriver中的另一个字符串
  • options.add_experimental_option("excludeSwitches", ["enable-automation"])
  • options.add_experimental_option('useAutomationExtension', False)
  • options.add_argument('--disable-blink-features=AutomationControlled')(这是必要的,以获得网站加载在非无头)
  • 设置navigator.webdriver = undefined
  • 设置navigator.pluginsnavigator.languagesnavigator.mimeTypes
  • window.ScreenYwindow.screenTopwindow.outerWidthwindow.outerHeight设置为非零
  • 设置window.chromewindow.navigator.chrome
  • 将图像的宽度和高度设置为非零
  • 设置WebGL参数
  • 修复Modernizr
    重复实验

为了让网站在正常(非无头)Selenium中加载,您必须从另一个网站访问_blank链接(以便目标网站在另一个选项卡中打开)。要复制这个实验,首先创建一个内容为<a href="https://poocoin.app" target="_blank">link</a>的html文件,然后在下面的代码中粘贴这个html文件的路径。
下面的版本(非无头)运行良好并加载网站,但如果您设置options.headless = True,它将在Cloudflare上卡住。

from selenium import webdriver
import time

# Replace this with the path to your html file
FULL_PATH_TO_HTML_FILE = 'file:///Users/simplepineapple/html/url_page.html'

def visit_website(browser):
    browser.get(FULL_PATH_TO_HTML_FILE)
    time.sleep(3)

    links = browser.find_elements_by_xpath("//a[@href]")
    links[0].click()
    time.sleep(10)

    # Switch webdriver focus to new tab so that we can extract html
    tab_names = browser.window_handles
    if len(tab_names) > 1:
        browser.switch_to.window(tab_names[1])

    time.sleep(1)
    html = browser.page_source
    print(html)
    print()
    print()

    if 'Charts' in html:
        print('Success')
    else:
        print('Fail')

    time.sleep(10)

options = webdriver.ChromeOptions()
# If options.headless = True, the website will not load
options.headless = False
options.add_argument("--window-size=1920,1080")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36')

browser = webdriver.Chrome(options = options)

browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
    "source": '''
    Object.defineProperty(navigator, 'webdriver', {
        get: () => undefined
    });
    Object.defineProperty(navigator, 'plugins', {
            get: function() { return {"0":{"0":{}},"1":{"0":{}},"2":{"0":{},"1":{}}}; }
    });
    Object.defineProperty(navigator, 'languages', {
        get: () => ["en-US", "en"]
    });
    Object.defineProperty(navigator, 'mimeTypes', {
        get: function() { return {"0":{},"1":{},"2":{},"3":{}}; }
    });

    window.screenY=23;
    window.screenTop=23;
    window.outerWidth=1337;
    window.outerHeight=825;
    window.chrome =
    {
      app: {
        isInstalled: false,
      },
      webstore: {
        onInstallStageChanged: {},
        onDownloadProgress: {},
      },
      runtime: {
        PlatformOs: {
          MAC: 'mac',
          WIN: 'win',
          ANDROID: 'android',
          CROS: 'cros',
          LINUX: 'linux',
          OPENBSD: 'openbsd',
        },
        PlatformArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        PlatformNaclArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        RequestUpdateCheckStatus: {
          THROTTLED: 'throttled',
          NO_UPDATE: 'no_update',
          UPDATE_AVAILABLE: 'update_available',
        },
        OnInstalledReason: {
          INSTALL: 'install',
          UPDATE: 'update',
          CHROME_UPDATE: 'chrome_update',
          SHARED_MODULE_UPDATE: 'shared_module_update',
        },
        OnRestartRequiredReason: {
          APP_UPDATE: 'app_update',
          OS_UPDATE: 'os_update',
          PERIODIC: 'periodic',
        },
      },
    };
    window.navigator.chrome =
    {
      app: {
        isInstalled: false,
      },
      webstore: {
        onInstallStageChanged: {},
        onDownloadProgress: {},
      },
      runtime: {
        PlatformOs: {
          MAC: 'mac',
          WIN: 'win',
          ANDROID: 'android',
          CROS: 'cros',
          LINUX: 'linux',
          OPENBSD: 'openbsd',
        },
        PlatformArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        PlatformNaclArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        RequestUpdateCheckStatus: {
          THROTTLED: 'throttled',
          NO_UPDATE: 'no_update',
          UPDATE_AVAILABLE: 'update_available',
        },
        OnInstalledReason: {
          INSTALL: 'install',
          UPDATE: 'update',
          CHROME_UPDATE: 'chrome_update',
          SHARED_MODULE_UPDATE: 'shared_module_update',
        },
        OnRestartRequiredReason: {
          APP_UPDATE: 'app_update',
          OS_UPDATE: 'os_update',
          PERIODIC: 'periodic',
        },
      },
    };
    ['height', 'width'].forEach(property => {
        const imageDescriptor = Object.getOwnPropertyDescriptor(HTMLImageElement.prototype, property);

        // redefine the property with a patched descriptor
        Object.defineProperty(HTMLImageElement.prototype, property, {
            ...imageDescriptor,
            get: function() {
                // return an arbitrary non-zero dimension if the image failed to load
            if (this.complete && this.naturalHeight == 0) {
                return 20;
            }
                return imageDescriptor.get.apply(this);
            },
        });
    });

    const getParameter = WebGLRenderingContext.getParameter;
    WebGLRenderingContext.prototype.getParameter = function(parameter) {
        if (parameter === 37445) {
            return 'Intel Open Source Technology Center';
        }
        if (parameter === 37446) {
            return 'Mesa DRI Intel(R) Ivybridge Mobile ';
        }

        return getParameter(parameter);
    };

    const elementDescriptor = Object.getOwnPropertyDescriptor(HTMLElement.prototype, 'offsetHeight');

    Object.defineProperty(HTMLDivElement.prototype, 'offsetHeight', {
        ...elementDescriptor,
        get: function() {
            if (this.id === 'modernizr') {
            return 1;
            }
            return elementDescriptor.get.apply(this);
        },
    });
    '''
})

visit_website(browser)

browser.quit()
ki1q1bka

ki1q1bka1#

@undetected Selenium的答案与https://github.com/diprajpatra/selenium-stealth完美配合
如果您使用的是最新版本的selenium,则需要更改可执行路径参数,因为它已经贬值,示例代码:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium_stealth import stealth

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)

stealth(driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
)

driver.get("https://bot.sannysoft.com/")

print(driver.find_element(By.XPATH, "/html/body").text)

driver.close()
lo8azlld

lo8azlld2#

我唯一可以建议的是--为了改进你的插件和导航器的mime类型,有时可以使用属性作为typeof(navigator.plugins,'PluginsArray')

Object.defineProperty(navigator, 'plugins', {
    get: () => {
        var ChromiumPDFPlugin = {};
        var plugin = {
            ChromiumPDFPlugin,
            description: 'Portable Document Format',
            filename: 'internal-pdf-viewer',
            length: 1,
            name: 'Chromium PDF Plugin',

        };
        plugin.__proto__ = Plugin.prototype;

        var plugins = {
            0: plugin,
            length: 1
        };
        plugins.__proto__ = PluginArray.prototype;
        return plugins;
    },
});

Object.defineProperty(navigator, 'mimeTypes', {
    get: () => {
        var mimeType = {
            type: 'application/pdf',
            suffixes: 'pdf',
            description: 'Portable Document Format',
            enabledPlugin: Plugin

        };
        mimeType.__proto__ = MimeType.prototype;

        var mimeTypes = {
            0: mimeType,
            length: 1
        };
        mimeTypes.__proto__ = MimeTypeArray.prototype;
        return mimeTypes;
    },
});

检查无头模式下出现问题的好网站是https://bot.sannysoft.com/
您可以在无头模式下运行并创建页面快照来检查是否全部通过
另外,有时候,即使navigator.webdriver设置为undefined,navigator仍然包含webdriver prop您可以使用下面的代码简单地rm:

const newProto = navigator.__proto__;
delete newProto.webdriver;
navigator.__proto__ = newProto;
bpsygsoo

bpsygsoo3#

pip install undetected-chromedriver
您可以使用此模块

ep6jt1vc

ep6jt1vc4#

如果检索用户代理,则使用最新的 Google Chrome v96.0

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/96.0.4664.110 Safari/537.36

在大多数情况下,额外的***Headless***字符串/参数/属性的存在被拦截为botcloudflare阻止访问网站。

解决方案

即使在headless模式下使用 Chrome,也有不同的方法来规避 Cloudflare 检测,其中一些有效的方法如下所示:

  • 一个有效的解决方案是使用undetected-chromedriver来初始化 *Chrome浏览上下文 *。undetected-chromedriver是一个优化的Selenium Chromedriver补丁,不会触发反机器人服务,如Distill Network / Imperva / DataDome /Botprotect.io。它会自动下载驱动程序二进制文件并对其进行修补。
  • 代码块:
import undetected_chromedriver as uc
from selenium import webdriver

options = webdriver.ChromeOptions() 
options.headless = True
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = uc.Chrome(options=options)
driver.get('https://bet365.com')

您可以在以下内容中找到一些相关的详细讨论:

  • Selenium应用程序在Heroku上托管时重定向到Cloudflare页面
  • 有没有可能绕过cloudflare安全检查的方法?
  • 最有效的解决方案是使用Selenium Stealth来初始化 *Chrome浏览上下文 *。selenium-stealth是一个防止被检测的python包。这个计划试图让Python selenium 更隐蔽。
  • 代码块:
from selenium import webdriver
from selenium_stealth import stealth

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r"C:\path\to\chromedriver.exe")

stealth(driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
        )

driver.get("https://bot.sannysoft.com/")

您可以在以下内容中找到一些相关的详细讨论:

  • 当你使用Selenium和chromedriver时,网站可以检测到吗?
  • 如何自动登录到一个网站,这是检测我尝试登录使用 selenium 隐形
ogq8wdun

ogq8wdun5#

cloudflare保护IUAM主要用于避免ddos攻击,因此它也保护网站免受自动化机器人的利用,因此无论您在客户端使用什么,cloudflare服务器都会对您进行指纹识别。之后,他们向客户端发送cf_clearance cookie,允许您在接下来的15分钟内连接。

相关问题