perl 将带有和方括号的句子拆分为多个

q8l4jmvw 于 2023-02-13 发布在 Perl

关注(0)|答案(2)|浏览(111)

我在字典文本数据中有这样的句子：
I have ｢an absolute [a deadly] abhorrence of ｢laziness [greasy food].
有没有一种方法，我可以把它分成4个句子如下，使它更容易在字典中搜索（使用Perl）？
I have an absolute abhorrence of laziness.
I have an absolute abhorrence of greasy food.
I have a deadly abhorrence of laziness.
I have a deadly abhorrence of greasy food.

perl

来源：https://stackoverflow.com/questions/75403775/split-one-sentence-with-and-square-brackets-into-multiple

2条答案

按热度按时间

jfgube3f1#

一个有趣的问题。这里有一个解决方案。
现在用<替换左圆括号｢并调整句子。†举一个例子：

word <a A1[b b1] and more <a A2[b b2] but <a A3[b b3] end

1.将字符串拆分为标记：包含备选项<...[...]的子字符串，以及包含单词组的子字符串。一旦我们到了这里，将每个备选项-子字符串分成两个备选项，并将其放入arrayref中。因此，我们将得到一个数组：

('word', ['a A1', 'b b1'], 'and more', ['a A2', 'b b2'], 

    'but', ['a A3', 'b b3'], 'end')

1.确定替代品的指数（此处为1,3,5）
1.创建这些索引的所有组合（作为一个集合，所以找到所有子集的集合，power set）。对于子集中的索引，我们在造句时采用第一种选择，对于不在子集中的索引，我们采用第二种选择（或相反）
1.遍历令牌数组并打印，如上所述选择备选项
我使用Algorithm::Combinatorics进行组合，但当然还有其他库。
上面介绍的带有测试句的程序（并且只有ascii字符）

use warnings;
use strict;
use feature 'say';

use List::Util qw(any none);    
use Algorithm::Combinatorics qw(subsets);

my $str = q(word <a A1[b b1] and more <a A2[b b2] but <a A3[b b3] end);
say $str;

 my @tokens = 
     map { /^</ ? [ /<([^\[]+) \[([^\]]+)\]/x ] : $_ }
     split /(<[^\[]+ \[[^\]]+\])/x, $str;
 #say "@tokens";

 my @idx = grep { ref $tokens[$_] eq 'ARRAY' } 0..$#tokens;
 #say "@idx";

my @subsets = subsets( \@idx );

for my $ss (@subsets) {
    my @take_0 = @$ss;
    for my $iw (0..$#tok) {
        if    (none { $iw == $_ } @idx)    { print " $tok[$iw] " }
        elsif (any  { $iw == $_ } @take_0) { print " $tok[$iw]->[0] " }
        else                               { print " $tok[$iw]->[1] " }
    }
    say '';
}

考虑到自然语言中的各种句子结构和细节，它有很大的简化。代码有很大的改进空间，也有一些清理工作要做（比如额外的空格），但它确实打印了所有带有替代短语的组合。
库一次可以生成一个项目：当在标量上下文中调用时，它的函数返回一个迭代器，->next在迭代器上给出下一个项。2这对于非常大的项集很重要。
下面是问题中给出的句子的程序。（上面的解决方案使用ascii（<）代替｢字符，因为一些系统仍然存在Unicode问题。除此之外，程序是相同的。）

use warnings;
use strict;
use feature 'say';

use List::Util qw(any none);    
use Algorithm::Combinatorics qw(subsets);

use utf8;
use open qw(:std :encoding(UTF-8));

my $str = q(I have ｢an absolute [a deadly] abhorrence of ｢laziness [greasy food].);
say $str;

my @tokens = 
    map { /^｢/ ? [ /｢([^\[]+) \[([^\]]+)\]/x ] : $_ }
    split /(｢[^\[]+ \[[^\]]+\])/x, $str;

my @idx = grep { ref $tokens[$_] eq 'ARRAY' } 0..$#tokens;

my @subsets = subsets( \@idx );

for my $ss (@subsets) {
    my @take_0 = @$ss;
    for my $iw (0..$#tokens) {
        if    (none { $iw == $_ } @idx)    { print " $tokens[$iw] " }
        elsif (any  { $iw == $_ } @take_0) { print " $tokens[$iw]->[0] " }
        else                               { print " $tokens[$iw]->[1] " }
    }
    say '';
}

赞(0）回复(0）举报 2023-02-13

klr1opcd2#

首先，解析为

my @def = (
   [ "I have " ],
   [ "an absolute", "a deadly" ],
   [ " abhorrence of " ],
   [ "laziness", "greasy food" ],
   [ "." ],
);

这可以通过使用以下验证解析器来实现：

my @def;
for ( $str ) {
   / \G ( [^｢]+ ) /xgc
      and push @def, [ $1 ];

   if ( / \G ｢ /xgc ) {
      / \G ( [^｢\[\]]+ ) [ ] \[ ( [^｢\[\]]+ ) \] /xgc
         or die( "Bad sequence at offset ".( pos() - 1 )."\n" );

      push @def, [ $1, $2 ];
      redo;
   }

   /\G \z /xgc
      and last;

   die( "Should not happen" );
}

然后查找产品。这可以使用以下方法实现：

use Algorithm::Loops qw( NestedLoops );

my $iter = NestedLoops( \@def );
while ( my @parts = $iter->() ) {
   say join "", @parts;
}

或

use Algorithm::Loops qw( NestedLoops );

NestedLoops( \@def, sub { say join "", @_; } );

赞(0）回复(0）举报 2023-02-13

我来回答

perl 将带有和方括号的句子拆分为多个

2条答案

相关问题

热门标签

最新问答