Perl选择性地拆分空间

krcsximq 于 2022-12-19 发布在 Perl

关注(0)|答案(2)|浏览(134)

我尝试在perl元素之间用空格来分割字符串，然而，每个元素也可能包含空格（通过双引号或用括号括起来）。
例如，包含以下内容的字符串：

for element in hydrogen   helium  "carbon  14"    $(some stuff "here")   FILE

我希望最后得到一个类似(hydrogen, helium, "carbon 14", "$(some stuff "here")", FILE)的数组
我可以处理for element in位并将其余的作为一个字符串。

@elements = split /(?<=\"[^\"]*\")\s+(?=\"[^\"]*\")/, $list

虽然正则表达式只匹配引号之间白色（在www.example.com上查看regexr.com），但Perl程序给了我Lookbehind longer than 255 not implemented in regex。
有没有更好的方法在空格上使用split来考虑这个问题？我的正则表达式哪里出错了？

perl

来源：https://stackoverflow.com/questions/63202473/perl-split-on-spaces-selectively

2条答案

按热度按时间

5w9g7ksd1#

匹配带引号或圆括号的表达式，* 然后 * 与非空格序列交替

my @elems = $string =~ / ( "[^"]+" | \S*\( [^)]+ \)\S* | \S+ ) /gx;

用你的琴弦和一些简单的变化来测试。
这里假设两个分隔符都没有嵌套：连续引号之间的表达式作为一个元素（即使它有括号中的子表达式），括号内的表达式也是如此（即使它有引号）。2这是从问题中推断出来的。
我允许括号前后的字符序列不包含空格，以适应前面的$，如果它确实 * 只能 * 是前面的一美元，请调整它。

赞(0）回复(0）举报 2022-12-19

s3fp2yjn2#

在这种情况下，我倾向于使用解析方法，这样你就不必使用一个正则表达式来做几件不同的事情，这一点很重要，因为字符串的复杂性会发生变化，尽管这看起来需要更多的代码，它是基本的Perl，您可以将它放在子例程中，我可以轻松地添加另一个令牌类型，而不会干扰代码的机制或重写模式。我在如何从一个模式中获取未知数量的捕获中也使用了这个技巧：

use v5.10;

my $string = 'for element in hydrogen   helium  "carbon  14"    $(some stuff "here")   FILE';

# The types of things you can match, going from most specific
# to least specific. Now you only need to describe what each
# individual thing looks like. Each pattern is responsible for
# the capture group $1, which is the thing we'll save.
my @patterns = (
    qr/ ( \$\( .+? \) ) /x,
    qr/ ( " .+? " )     /x,
    qr/ ( \S+ )         /x,
    );

my @tokens;
# The magic is global matching in scalar context,
# using /g. The \G anchor starts matching at the
# last position you matched in the prior match of
# the same string (that's in pos()). Normally that
# position is reset when a match fails, but /c
# prevents that so you can try other patterns. Once
# you match a pattern, save what you matched and
# move on.
#
# The pattern here also takes care of trailing whitespace.
while( pos($string) < length($string) ) {
    foreach my $pattern ( @patterns ) {
        next unless $string =~ m/ \G $pattern \s*/gcx;
        push @tokens, $1;
        last;
        }
    }

use Data::Dumper;
say Dumper( \@tokens );

您可以对branch reset operator执行许多相同的操作，每次捕获交替为$1：

use v5.10;

my $string = 'for element in hydrogen   helium  "carbon  14"    $(some stuff "here")   FILE';

my @tokens = $string =~ m/
    (?|
        (?: ( \$ \( .+? \) ) ) |
        (?: ( " .+? "      ) ) |
        (?: ( \S+          ) )
    )
    /gx;

use Data::Dumper;
say Dumper( \@tokens );

这比zdim's answer要复杂一些，但是它更灵活，比如说，你决定不需要在"carbon 14"两边加上引号，这是一个非常容易的修正，因为正则表达式的结构没有改变，你只需要改变处理这个标记的子模式：

(?|
        (?:   ( \$ \( .+? \) )   ) |
        (?: " ( .+?          ) " ) |
        (?:   ( \S+          )   )
    )

你可能不需要这种额外的灵活性。我通常会发现在这类任务中我会遇到额外的奇怪情况，所以我从灵活的解决方案开始。在你做了几次之后，这并不是一件大事。
至于你的错误，你得到了：
正则表达式中未实现长度超过255的Lookbehind。
在v5.30之前，你不能有一个variable-width lookbehind。现在它是一个实验特性，但是模式必须事先知道它不会超过255个字符。你的模式有(?<=\"[^\"]*\")，并且*是零或更大。这个更大的值可以大于255，所以它是一个非法的模式。
regexr.com 使用PCRE，PCRE过去代表“Perl Compatible”，但是它们之间的差异已经很大，有些东西看起来在其他语言中可以正常工作，但在Perl中就不行了。这通常不是问题，但lookbehinds是区别之一。

赞(0）回复(0）举报 2022-12-19

我来回答

Perl选择性地拆分空间

2条答案

相关问题

热门标签

最新问答