regex 将带引号和不带引号的字符串捕获到同一个正则表达式捕获组

q7solyqu  于 2023-04-13  发布在  其他
关注(0)|答案(2)|浏览(128)

正则表达式PCRE 2(PHP〉= 7.3)

我有一个包含多个<img>标签的多行字符串。
使用正则表达式,我想捕获:
a.)包含src属性的整个img标签和
B.)该src属性的内容。

  • src属性可以以“”或''结尾,也可以完全不用引号括起来。
  • 如果用引号括起来,则属性不应包含“或”
  • 如果不加引号,则属性不应包含\s或〉

我花了一整天的时间才使它工作起来,但我需要帮助来改进它。问题是,带引号的src属性在$matches[2]中捕获,而不带引号的src属性转到$matches[3]。由于我需要在一个匹配组中捕获所有路径,所以我将$matches[2]复制到$matches[3]。我宁愿捕获的数据直接转到同一个捕获组。

$code = <<<EOD
1. <img width src = one height>
2. <notAtStart><img src=two height>
3. NotAtAll
4. <img width src="three"><notAtEnd>
5. <notAtStart><img src = 'four' /><notAtEnd>
6. <img src =five><test>
7. <img WithoutSrc>
EOD;

$regex='/(<IMG(?=\s).*\sSRC\s*=\s*
    (?(?=["\'])
        .(.+?) ["\'] 
     |         
        (.+?) [\s>] 
    )
    (?(?<!>).*?>)
)/ix';

preg_match_all($regex, $code, $matches);
echo PHP_EOL . "Matches:";
// print all groups:
print_r($matches);

// copy matches captures in $matches[2] to $matches[3]
foreach($matches[2] as $a=>$b) 
    if ($b != "")
        $matches[3][$a] = $b;

// print the whole captured img tags:
print_r($matches[1]);
// print just the captured paths:
print_r($matches[3]);

输出:

Matches:Array
(
    [0] => Array
        (
            [0] => <img width src = one height>
            [1] => <img src=two height>
            [2] => <img width src="three">
            [3] => <img src = 'four' />
            [4] => <img src =five>
        )

    [1] => Array
        (
            [0] => <img width src = one height>
            [1] => <img src=two height>
            [2] => <img width src="three">
            [3] => <img src = 'four' />
            [4] => <img src =five>
        )

    [2] => Array
        (
            [0] =>
            [1] =>
            [2] => three
            [3] => four
            [4] =>
        )

    [3] => Array
        (
            [0] => one
            [1] => two
            [2] =>
            [3] =>
            [4] => five
        )

)
Array
(
    [0] => <img width src = one height>
    [1] => <img src=two height>
    [2] => <img width src="three">
    [3] => <img src = 'four' />
    [4] => <img src =five>
)
Array
(
    [0] => one
    [1] => two
    [2] => three
    [3] => four
    [4] => five
)

(And是的,我知道,根本不应该使用regex来抓取html,因为不建议这样做。)

8tntrjer

8tntrjer1#

理想情况下,可以使用解析器来实现这一点。您的正则表达式可以更新为如下内容:

<IMG(?=\s).*\sSRC\s*=\s*(['"])?(.+?)(?:\1|>|\s)

这应该更接近于你想要实现的目标。这使用了一个捕获组而不是两个用于属性内容。
https://regex101.com/r/qYB9B7/1

5jvtdoz2

5jvtdoz22#

正如您已经提到的,您知道使用解析器是更好的选择,这里是另一个正则表达式选项。
如果你想匹配引号,或者匹配没有引号或尖括号的非空格字符,你也可以使用named capture groupJflag来允许重复的子模式名称。

<IMG(?=\s)[^<>]*\sSRC\s*=\s*(?:(['"])(?<att>[^'"]+)\1|(?<att>[^\s'"<>]+))[^<>]*>

说明

  • <IMG(?=\s)匹配<IMG并在右侧Assert一个空白字符
  • [^<>]*匹配<>以外的可选字符
  • \sSRC\s*=\s*匹配一个空白字符、SRC和可选空白字符之间的等号
  • (?: 2个备选方案的非捕获组
  • (['"])(?<att>[^'"]+)\1捕获组1,捕获'",然后在命名组att中匹配除相同右引号(使用反向引用\1)之间的引号之外的1+个字符
  • |
  • (?<att>[^\s'"<>]+)匹配1+组att中除'"<>以外的非空格字符
  • )关闭非捕获组
  • [^<>]*>匹配<>以外的可选字符,然后匹配>

regex demo|PHP demo

$re = '/<IMG(?=\s)[^<>]*\sSRC\s*=\s*(?:([\'"])(?<att>[^\'"]+)\1|(?<att>[^\s\'"<>]+))[^<>]*>/iJ';
$str = '<img width src = one height>
2. <notAtStart><img src=two height>
3. NotAtAll
4. <img width src="three"><notAtEnd>
5. <notAtStart><img src = \'four\' /><notAtEnd>
6. <img src =five><test>
7. <img WithoutSrc>';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r($matches);

输出

Array
(
    [0] => Array
        (
            [0] => <img width src = one height>
            [1] => 
            [att] => one
            [2] => 
            [3] => one
        )

    [1] => Array
        (
            [0] => <img src=two height>
            [1] => 
            [att] => two
            [2] => 
            [3] => two
        )

    [2] => Array
        (
            [0] => <img width src="three">
            [1] => "
            [att] => three
            [2] => three
        )

    [3] => Array
        (
            [0] => <img src = 'four' />
            [1] => '
            [att] => four
            [2] => four
        )

    [4] => Array
        (
            [0] => <img src =five>
            [1] => 
            [att] => five
            [2] => 
            [3] => five
        )

)

然后,您可以循环$matches并获取att键的值。

foreach ($matches as $m) {
    echo $m["att"] . PHP_EOL;
}

输出

one
two
three
four
five

相关问题