regex 正则表达式删除所有空HTML标记

yebdmbv4 于 2023-06-25 发布在其他

关注(0)|答案(7)|浏览(100)

这是我的PHP函数，用于从字符串输入中删除所有空的HTML标签：

/**
 * Remove the nested HTML empty tags from the string.
 *
 * @param $string String to remove tags
 * @param null $replaceTo Replace empty string with
 * @return mixed Cleaned string
 */
function crl_remove_empty_tags($string, $replaceTo = null)
{
    // Return if string not given or empty
    if (!is_string($string) || trim($string) == '') return $string;

    // Recursive empty HTML tags
    return preg_replace(
        '/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm',
        !is_string($replaceTo) ? '' : $replaceTo,
        $string
    );
}

我的正则表达式：/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm
我用http://gskinner.com/RegExr/和http://regexpal.com/测试了它，它工作得很好。但当我试着运行它。服务器总是返回错误：

Warning: preg_replace(): Unknown modifier '\'

我不知道到底哪里出了问题。谁来帮帮我！

regex

来源：https://stackoverflow.com/questions/21051428/regex-to-remove-all-empty-html-tags

7条答案

按热度按时间

r6vfmomb1#

在php正则表达式中，如果定界符出现在表达式中，则需要对其进行转义。
在本例中，有两个未转义的/;只需将它们替换为\/即可。您也不需要修饰符数组-- php默认是全局的，并且您没有定义文字字符。
之前：

/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm

之后：

/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/
//                                                                    ^       ^

赞(0）回复(0）举报 2023-06-25

nqwrtyyt2#

该模式能够移除“空标签”（即不包含任何内容、空格、HTML注解或其他“空标签”的非自关闭标签），即使这些标签像<span><span></span></span>一样嵌套。HTML评论中的标签不被考虑：

$pattern = <<<'EOD'
~
<
(?:
    !--[^-]*(?:-(?!->)[^-]*)*-->[^<]*(*SKIP)(*F) # skip comments
  |
    ( # group 1
        (\w++)     # tag name in group 2
        [^"'>]* #'"# all that is not a quote or a closing angle bracket
        (?: # quoted attributes
            "[^\\"]*(?:\\.[^\\"]*)*+" [^"'>]* #'"# double quote
          |
            '[^\\']*(?:\\.[^\\']*)*+' [^"'>]* #'"# single quote
        )*+
        >
        \s*
        (?:
            <!--[^-]*(?:-(?!->)[^-]*)*+--> \s* # html comments
          |
            <(?1) \s*                          # recursion with the group 1
        )*+
        </\2> # closing tag
    ) # end of the group 1
)
~sxi
EOD;

$html = preg_replace($pattern, '', $html);

局限性：

此方法将删除指向外部JavaScript文件的链接：

<script src="myscript.js"></script>

该模式可能会删除部分嵌入的JavaScript代码，如果类似于：

var myvar="<span></span>";
或类似：
var myvar1="<span></span>";
找到了
这些限制是由于基本的文本方法无法区分html和JavaScript代码。然而，如果你在模式跳过列表中添加“脚本”标签（与html注解相同），就有可能解决这个问题，但在这种情况下，你需要基本上描述JavaScript内容（字符串，注解，文字模式，所有不是前三个），这不是一个微不足道的任务，但可能的。

赞(0）回复(0）举报 2023-06-25

1l5u6lss3#

删除空元素...和下一个空元素。
体育

<p>Hello!
   <div class="foo"><p id="nobody">
   </p>
      </div>
 </p>

结果如下：

<p>Hello!</p>

PHP代码：

/* $html store the html content */
do {
    $tmp = $html;
    $html = preg_replace( '#<([^ >]+)[^>]*>([[:space:]]|&nbsp;)*</\1>#', '', $html );
} while ( $html !== $tmp );

赞(0）回复(0）举报 2023-06-25

s8vozzvw4#

我不确定这是否是你需要的，但我今天找到了这个。你需要PHP 5.4 +!

$oDOMHTML = DOMDocument::loadHTML( 
    $sYourHTMLString, 
    LIBXML_HTML_NOIMPLIED | 
    LIBXML_HTML_NODEFDTD | 
    LIBXML_NOBLANKS | 
    LIBXML_NOEMPTYTAG 
);
$sYourHTMLStringWithoutEmptyTags = $oDOMHTML->saveXML();

也许这对你有用。

赞(0）回复(0）举报 2023-06-25

emeijp435#

你也可以使用递归来解决这个问题。继续将HTML blob传递回函数，直到空标记不再存在。

public static function removeHTMLTagsWithNoContent($htmlBlob) {
    $pattern = "/<[^\/>][^>]*><\/[^>]+>/";

    if (preg_match($pattern, $htmlBlob) == 1) {
        $htmlBlob = preg_replace($pattern, '', $htmlBlob);
        return self::removeHTMLTagsWithNoContent($htmlBlob);
    } else {
        return $htmlBlob;
    }
}

这将检查空HTML标记的存在并替换它们，直到正则表达式模式不再匹配。

赞(0）回复(0）举报 2023-06-25

8ftvxx2r6#

下面是删除所有空标记的另一种方法。（它还删除周围的标签，如果它们由于空的子级而被条件为空：

/**
 * Remove empty tags.
 * This one will also remove <p><a href="/foo/bar.baz"><span></span></a></p> (empty paragraph with empty link)
 * But it will not alter <p><a href="/foo/bar.baz"><span>[CONTENT HERE]</span></a></p> (since the span has content)
 *
 * Be aware: <img ../> will be treated as an empty tag!
 */
do
{
    $len1 = mb_strlen($string);
    $string = preg_replace('/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/', '', $string);
    $len2 = mb_strlen($string);

} while ($len1 > 0 && $len2 > 0 && $len1 != $len2);

我一直在用这个来清理外部CMS的HTML，结果是积极的。

赞(0）回复(0）举报 2023-06-25

zzlelutf7#

$string = '<p>Some <b>HTML</b> <strong>text. </strong> <hr></p>';
$clean_string = preg_replace('#<[^>]+>#', '', $string);
echo $clean_string; // Some HTML text.

赞(0）回复(0）举报 2023-06-25

我来回答

regex 正则表达式删除所有空HTML标记

7条答案

相关问题

热门标签

最新问答