Selenium:如何在加载/执行页面的任何其他脚本之前注入/执行JavaScript到页面中?

f45qwnt8  于 2023-11-15  发布在  Java
关注(0)|答案(6)|浏览(176)

我正在使用Selenium python webdriver来浏览一些页面。我想在加载和执行任何其他JavaScript代码之前将JavaScript代码注入到页面中。另一方面,我需要将我的JS代码作为该页面的第一个JS代码执行。Selenium有办法做到这一点吗?
我在谷歌上搜索了几个小时,但我找不到任何合适的答案!

mzillmmw

mzillmmw1#

Selenium现在支持Chrome Devtools协议(CDP)API,因此,在每次页面加载时执行脚本非常容易。下面是一个示例代码:

driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {'source': 'alert("Hooray! I did it!")'})

字符串
它将为每个页面加载执行该脚本。有关此的详细信息可以在以下位置找到:

vbopmzt1

vbopmzt12#

从1.0.9版开始,selenium-wire获得了修改请求响应的功能。下面是此功能的一个示例,在页面到达Web浏览器之前将脚本注入页面。

import os
from seleniumwire import webdriver
from gzip import compress, decompress
from urllib.parse import urlparse

from lxml import html
from lxml.etree import ParserError
from lxml.html import builder

script_elem_to_inject = builder.SCRIPT('alert("injected")')

def inject(req, req_body, res, res_body):
    # various checks to make sure we're only injecting the script on appropriate responses
    # we check that the content type is HTML, that the status code is 200, and that the encoding is gzip
    if res.headers.get_content_subtype() != 'html' or res.status != 200 or res.getheader('Content-Encoding') != 'gzip':
        return None
    try:
        parsed_html = html.fromstring(decompress(res_body))
    except ParserError:
        return None
    try:
        parsed_html.head.insert(0, script_elem_to_inject)
    except IndexError: # no head element
        return None
    return compress(html.tostring(parsed_html))

drv = webdriver.Firefox(seleniumwire_options={'custom_response_handler': inject})
drv.header_overrides = {'Accept-Encoding': 'gzip'} # ensure we only get gzip encoded responses

字符串
另一种远程控制浏览器并能够在页面内容加载之前注入脚本的方法是使用完全基于单独协议的库,例如:Chrome DevTools Protocol。

hgc7kmma

hgc7kmma3#

如果你想在页面被浏览器解析和执行之前注入一些东西到页面的html中,我建议你使用一个代理,比如Mitmproxy

xmjla07d

xmjla07d4#

如果你不能修改页面内容,你可以使用代理,或者使用浏览器中安装的扩展中的内容脚本。在selenium中这样做,你会编写一些代码,将脚本作为现有元素的子元素之一注入,但是在页面加载之前(当驱动程序的get()调用返回时),你不能让它运行。

String name = (String) ((JavascriptExecutor) driver).executeScript(
    "(function () { ... })();" ...

字符串
文档中没有指定代码开始执行的时间。你可能希望它在DOM开始加载之前就开始执行,这样保证可能只能通过代理或扩展内容脚本路由来满足。
如果你可以用最小的工具来检测你的页面,你可能会检测到一个特殊的url查询参数的存在,并加载额外的内容,但你需要使用内联脚本来实现。伪代码:

<html>
    <head>
       <script type="text/javascript">
       (function () {
       if (location && location.href && location.href.indexOf("SELENIUM_TEST") >= 0) {
          var injectScript = document.createElement("script");
          injectScript.setAttribute("type", "text/javascript");

          //another option is to perform a synchronous XHR and inject via innerText.
          injectScript.setAttribute("src", URL_OF_EXTRA_SCRIPT);
          document.documentElement.appendChild(injectScript);

          //optional. cleaner to remove. it has already been loaded at this point.
          document.documentElement.removeChild(injectScript);
       }
       })();
       </script>
    ...

n7taea2i

n7taea2i5#

所以我知道这是几年前的事了,但我已经找到了一种方法来做到这一点,而无需修改网页的内容,也无需使用代理!我使用的是nodejs版本,但想必API对其他语言也是一致的。你想做的是如下所示

const {Builder, By, Key, until, Capabilities} = require('selenium-webdriver');
const capabilities = new Capabilities();
capabilities.setPageLoadStrategy('eager'); // Options are 'eager', 'none', 'normal'
let driver = await new Builder().forBrowser('firefox').setFirefoxOptions(capabilities).build();
await driver.get('http://example.com');
driver.executeScript(\`
  console.log('hello'
\`)

字符串
这个“渴望”选项对我来说很有效。你可能需要使用“无”选项。文档:https://seleniumhq.github.io/selenium/docs/api/javascript/module/selenium-webdriver/lib/capabilities_exports_PageLoadStrategy.html
编辑:请注意,“渴望”选项尚未在Chrome中实现.

nfeuvbwi

nfeuvbwi6#

更新版本的@Mattwmaster58的答案,适用于最新版本的selenium-wire(撰写本文时为5.1.0)。还增加了对内联脚本标记的nonce attributes支持。

from lxml import html
from lxml.etree import ParserError
from lxml.html import builder
from seleniumwire import webdriver
from seleniumwire.request import Request, Response
from seleniumwire.thirdparty.mitmproxy.net.http import encoding as decoder

SCRIPT_BODY_TO_INJECT = 'alert("injected")'

def has_mime_type(header: str, expected_type: str) -> bool:
    return header == expected_type or header.startswith(expected_type + ";")

def response_interceptor(request: Request, response: Response) -> None:
    content_type = response.headers.get("Content-Type")
    if (
        response.status_code != 200
        or not content_type
        or not has_mime_type(content_type, "text/html")
    ):
        return

    encoding = response.headers.get("Content-Encoding", "identity")
    try:
        parsed_html = html.fromstring(decoder.decode(response.body, encoding))
    except ParserError:
        return

    # Preserve nonce attribute to allow inline script.
    # https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/nonce
    attrs = {}
    if (nonce_script := parsed_html.find(".//script[@nonce]")) is not None:
        attrs["nonce"] = nonce_script.get("nonce")
    try:
        injected_script = builder.SCRIPT(SCRIPT_BODY_TO_INJECT, **attrs)
        parsed_html.head.insert(0, injected_script)
    except IndexError:  # No head element.
        return

    response.body = decoder.encode(
        html.tostring(parsed_html.getroottree()), encoding
    )
    del response.headers["Content-Length"]  # Avoid duplicate header.
    response.headers["Content-Length"] = str(len(response.body))

def main():
    with webdriver.Firefox() as session:
        session.response_interceptor = response_interceptor
        session.get("https://example.com")

if __name__ == "__main__":
    main()

字符串
作为使用lxml生成输出的替代方案(这可能会改变HTML的结构),您还可以使用正则表达式插入标记并保留现有格式:

from lxml import html
from lxml.etree import ParserError
from lxml.html import builder
from mimeparse import parse_mime_type
from seleniumwire import webdriver
from seleniumwire.request import Request, Response
from seleniumwire.thirdparty.mitmproxy.net.http import encoding as decoder
import re

SCRIPT_BODY_TO_INJECT = 'alert("injected")'
HEAD_TAG_RE = re.compile(r"<head\s*>()", re.IGNORECASE)
INLINE_SCRIPT_TAG_RE = re.compile(
    r"()<script\b(?:(?!\bsrc\b\s*=\s*['\"]).)*?>", re.IGNORECASE
)

def response_interceptor(request: Request, response: Response) -> None:
    content_type = response.headers.get("content-type")
    if not content_type:
        return

    mime_type, mime_subtype, mime_params = parse_mime_type(content_type)
    if (
        response.status_code != 200
        or mime_type != "text"
        or mime_subtype != "html"
    ):
        return

    encoding = response.headers.get("content-encoding", "identity")
    charset = mime_params.get("charset", "iso-8859-1")
    try:
        decoded_body = decoder.decode(response.body, encoding).decode(charset)
        parsed_html = html.fromstring(decoded_body)
    except ParserError:
        return

    # Preserve nonce attribute to allow inline script.
    # https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/nonce
    attrs = {}
    if (nonce_script := parsed_html.find(".//script[@nonce]")) is not None:
        attrs["nonce"] = nonce_script.get("nonce")

    # Some sites inject scripts before the DOCTYPE, which isn't valid markup
    # but still runs.
    if m := min((x for regex in (INLINE_SCRIPT_TAG_RE, HEAD_TAG_RE)
                 if (x := regex.search(decoded_body))),
                key=lambda x: x.start()):
        injected_script_text = html.tostring(
            builder.SCRIPT(SCRIPT_BODY_TO_INJECT, **attrs), encoding="unicode"
        )
        replacement = (
            m.string[m.start(): m.start(1)]
            + injected_script_text
            + m.string[m.start(1): m.end()]
        )
        modified_body = m.string[:m.start()] + replacement + m.string[m.end():]

        response.body = decoder.encode(modified_body.encode(charset), encoding)
        del response.headers["Content-Length"]  # Avoid duplicate header.
        response.headers["Content-Length"] = str(len(response.body))

def main():
    with webdriver.Firefox() as session:
        session.response_interceptor = response_interceptor
        session.get("https://example.com")

if __name__ == "__main__":
    main()

相关问题