regex 将无效的html标记替换为< and >to<和>

avwztpqn 于 2023-08-08 发布在其他

关注(0)|答案(1)|浏览(88)

当在机器人的帮助下向Telegram发送消息时，使用html格式，这意味着当尝试发送消息时会出现错误，您需要将这些箭头替换为<和>，但每次将它们写在一堆文本中都不方便，我想试着做一个定期的，将自动替换这些东西，但不接触有效的html标签，例如：
不需要替换的有效标记的示例

<a href="tg://user?id=1">Bot</a>

字符串
需要替换的无效标记

<argument>

型
下面是我尝试编写的代码，但最终无法正常工作

import re

def replace_invalid_tags(html_string):
    invalid_tag_pattern = r'<[^a-zA-Z/!?](.*?)>'
    fixed_html = re.sub(invalid_tag_pattern, r'&lt;\1&gt;', html_string)
    return fixed_html

html_string = '<a href="#">Link</a> <argument1> <argument2>'
fixed_html = replace_invalid_tags(html_string)
print(fixed_html)

型

regex

来源：https://stackoverflow.com/questions/76830215/replace-invalid-html-tags-with-and-to-lt-and-gt

1条答案

按热度按时间

xzlaal3s1#

我个人建议您使用 Python HTML解析或清理库来完成这项工作。正则表达式很棒，我喜欢它们。但在某些情况下，我更喜欢使用经过良好测试的库，这些库是专门为解决问题而构建的。
我不是一个Python程序员，但主要是一个PHP程序员。在好的CMS项目中，您可以添加一些清理库，如HTMLPurifier并定义规则。
在你的例子中，一些标签应该被转换成HTML实体，以便显示为普通文本，而在其他一些情况下，标签必须保持原样。当然，一些属性和特定的标签也应该被删除（例如：<img onload="alert('xss attack')"或<script>alert('bad')</script>。这是解析器或清理库将做一个更安全的工作。
假设允许使用这些标记：

<a>，具有href属性。可能不应允许其他属性。通常，我会删除style="font-size: 100px"。
和，不带属性。旧的和标签怎么样？我将它们分别转换为和，因为它们可能对可读性有用，但在 Telegram 中不允许。

所有其他标签都应该转换为<var>（如果允许），内容转换为HTML特殊字符（<到<和>到>）。如果需要的话，处理其他转换可能是安全的。
在Python中，我看到你可以使用html-sanitizer library。
我看到可以定义一些预处理器函数，通常是在需要时转换一些标记。这意味着您可以创建一个函数，将所有未经授权的标记转换为<var>或<pre>标记，并使用找到的标记的转义等效HTML填充其内容。一些预构建的预处理器函数已经存在，例如bold_span_to_strong（），因此有一些示例可以解决您的问题。
一个查找无效标签的纯正则表达式解决方案可以这样做：

<\s*/?\s*(?!(?:a|strong|em)\b)\w+[^>]*>

字符串
示例：https://regex101.com/r/xiZl1n/1
我接受可选的空格，结束标记的斜线，然后使用负向前看，以避免匹配您想要接受的标记。我在valid标签后面添加了-Boundary \b，以避免它接受以“a”字符开头的<argument>。我只想匹配完整的单词。
然后你可以决定如何处理你所有的比赛。如果你想直接用<替换<，你可以这样做：
https://regex101.com/r/xiZl1n/3的

编辑：句柄`->`、`>.<`、`=>`等

我仍然相信解析器是最好的选择。但是你问是否可以修改正则表达式来处理更多的情况。我个人不认为一个正则表达式可以做到这一点。当然也会不安全。
但正如我所评论的，你可以尝试几个步骤：
1.如果您认为值得，请将标记转换为，将标记转换为。
1.找到所有有效的标签，如<a>、、，并分别用[a]、[strong]和[em]替换它们。这可以通过以下模式来完成：

<\s*(/?)\s*(?=(?:a|strong|em)\b)(\w+[^>]*)>

型
并替换为[\1\2]。正在运行：https://regex101.com/r/xiZl1n/4

现在，您可以将<替换为<，将>替换为>：
<变为<：https://regex101.com/r/xiZl1n/5的
>变为>：https://regex101.com/r/xiZl1n/6
将有效标记转换回正确的HTML。它将是与步骤1中相同的正则表达式，但使用了方括号：

\[\s*(/?)\s*((?:a|strong|em)\b)([^\]]*)\]

型
并替换为<\1\2\3>。正在运行：https://regex101.com/r/xiZl1n/8
在该步骤中，捕获组n°3包含所有标签属性。这是你可以过滤它，只接受一些特定的，如href，id，title，但删除所有其他（例如：class、style、onclick）。
使i标志区分大小写可能很重要。这就是它在 JavaScript 中的样子：

const input = `<a href="https://www.example.com/page">Example page</a> is a <strong>valid</strong> tag.
< EM class="something"> is also allowed</em>
But <param> or <argument> should be escaped.
This also: <br/> <
br /> <img onload="alert('xss')" src="http://images.com/nice-cat.jpg" alt="Nice cat" />
<script type="text/javascript">alert('Bad JS code')</` + `script>
<a href="javascript:alert('XSS attack')" onclick="alert('xss')">A bad link</a>
<a href = http://test.com title="This is just a test">test.com</a>

Turn left <- or turn right ->
Also, accept this => or a smiley >.<

Accept <B>bold</B> and <i style="color:green">italic without style</i> converted to new tags.
Also strip <b href="https://www.google.com">wrong attributes</b>`;

// Attributes to drop are all design changes done with classes or style
// and all attributes such as onload, onclick, etc.
const regexAttributeToDrop = /^(?:style|class|on.*)$/i;
// The attributes which can have a URL.
const regexAttributeWithURL = /^(?:href|xlink:href|src)$/i;
// Only accept absolute URLs and not bad stuff like javascript:alert('xss')
const regexValidURL = /^(https?|ftp|mailto|tel):/i;

/**
 * Filter tag attributes, based on the tag name, if provided.
 *
 * @param string attributes All the attributes of the tag.
 * @param string tagName Optional tag name (in lowercase).
 * @return string The filtered string of attributes.
 */
function filterAttributes(attributes, tagName = '') {
  // Regex for attribute: $1 = attribute name, $2 = value, $3 = simple/double quote or nothing.
  const regexAttribute = /\s*([\w-]+)\s*=\s*((?:(["']).*?\3|[^"'=<>\s]+))/g;
  
  attributes = attributes.replace(regexAttribute, function (attrMatch, attrName, attrValue, quote) {
    // Don't keep attributes that can change the rendering or run JavaScript.
    if (name.match(regexAttributeToDrop)) {
      return '';
    }
    // Not an attribute to drop.
    else {
      // If the attribute is "href" or "xlink:href" then only accept full URLs
      // with the correct protocols and only for <a> tags.
      if (attrName.match(/^(?:xlink:)?href$/i)) {
        // If it's not a link then drop the attribute.
        if (tagName !== 'a') {
          return '';
        }
        // The attribute value can be quoted or not so we'll remove them.
        const url = attrValue.replace(/^['"]|['"]$/g, '');
        // If the URL is valid.
        if (url.match(regexValidURL)) {
          return ` ${attrName}="${url}"`;
        }
        // Invalid URL: drop href and notify visually.
        else {
          return ' class="invalid-url" title="Invalid URL"';
        }
      }
      // All other attributes: nothing special to do.
      else {
        return ` ${attrName}=${attrValue}`;
      }
    }
  });

  // Clean up: trim spaces around. If it's not empty then just add a space before.
  attributes = attributes.trim();
  if (attributes.length) {
    attributes = ' ' + attributes;
  }
  
  return attributes;
}

const steps = [
  {
    // Replace <b> by <strong>.
    search: /<\s*(\/?)\s*(b\b)([^>]*)>/gi,
    replace: '<$1strong>' // or '<$1strong$3>' to keep the attributes.
  },
  {
    // Replace <i> by <em>.
    search: /<\s*(\/?)\s*(i\b)([^>]*)>/gi,
    replace: '<$1em>' // or '<$1em$3>' to keep the attributes.
  },
  {
    // Transform accepted HTML tags into bracket tags.
    search: /<\s*(\/?)\s*(?=(?:a|strong|em)\b)(\w+[^>]*)>/gi,
    replace: '[$1$2]'
  },
  {
    // Replace "<" by "&lt;".
    search: /</g,
    replace: '&lt;'
  },
  {
    // Replace ">" by "&gt;".
    search: />/g,
    replace: '&gt;'
  },
  {
    // Transform the accepted tags back to correct HTML.
    search: /\[\s*(\/?)\s*((?:a|strong|em)\b)([^\]]*)\]/gi,
    // For the replacement, we'll provide a callback function
    // so that we can alter the attributes if needed.
    replace: function (fullMatch, slash, tagName, attributes) {
      // Convert the tag name to lowercase.
      tagName = tagName.toLowerCase();
      // If the slash group is empty then it's the opening tag.
      if (slash === '') {
        attributes = filterAttributes(attributes, tagName);
        return '<' + tagName + attributes + '>';
      }
      // The closing tag.
      else {
        return '</' + tagName + '>';
      }
    }
  },
  {
    // Optional: inject <br /> tags at each new lines to preserve user input.
    search: /(\r?\n)/gm,
    replace: '<br />$1'
  }
];

let output = input;

steps.forEach((step, i) => {
  output = output.replace(step.search, step.replace);
});

document.getElementById('input').innerText = input;
document.getElementById('output').innerText = output;
document.getElementById('rendered-output').innerHTML = output;

body {
  font-family: Georgia, "Times New Roman", serif;
  margin: 0;
  padding: 0 1em 1em 1em;
}

p:first-child { margin-top: 0; }

pre {
  background: #f8f8f8;
  border: 1px solid gray;
  padding: .5em;
  overflow-x: scroll;
}

.box {
  background: white;
  box-shadow: 0 0 .5em rgba(0, 0, 0, 0.1);
  padding: 1em;
}

.invalid-url {
  color: darkred;
}

<h1>Quick &amp; dirty HTML filtering with regular expressions</h1>

<p>Input:</p>
<pre id="input"></pre>

<p>Output:</p>
<pre id="output"></pre>

<p>Rendered output:</p>
<div id="rendered-output" class="box"></div>

赞(0）回复(0）举报 2023-08-08

我来回答

regex 将无效的html标记替换为< and >to<和>

1条答案

编辑：句柄`->`、`>.<`、`=>`等

相关问题

热门标签

最新问答

regex 将无效的html标记替换为< and >to&lt;和&gt;

1条答案

编辑：句柄->、>.<、=>等

相关问题

热门标签

最新问答

regex 将无效的html标记替换为< and >to<和>

编辑：句柄`->`、`>.<`、`=>`等