perl 标注语料库中的单词

jmp7cifd 于 2022-11-15 发布在 Perl

关注(0)|答案(2)|浏览(185)

我是一个语言学家的贸易和使用Perl来帮助我组织我的数据，所以我不是在所有熟练和我的解决方案不是很精简：）我今天的问题，是我想标记的项目在一个单词表中的所有出现在语料库的例句。
我为另一个实验写了一个（草率的）脚本，但它只识别出它找到的列表中的第一个单词，然后移动到下一个。我需要它找到所有句子中的所有单词，并在上面，在单词列表的第二列分配一个标记。我的头只是不知道怎么做。救命！
假设我的单词表看起来像这样（一个制表符分隔的.txt文件，其中1列用于单词，1列用于要分配的标签，每个单词和标签在一行上）：

boy (\t) ID01
sleep (\t) ID02
dog (\t) ID03
hungry (\t) ID04
home (\t) ID05

原始语料库将只是一个标记化的.txt，每行一个句子，例如：

The boy and his dog met a hungry lion .
They needed to catch up on sleep after the dog and the boy came home .

理想情况下，我希望输出如下所示：
(1)标记语料库的格式为：

The boy [ID01] and his dog [ID03] met a hungry [ID04] lion .

They needed to catch up on sleep [ID02] after the dog [ID03] and the boy [ID01] came home [ID05].

(2)在语料库中根本找不到的单词及其标签的列表。我以前只是将这些单词打印到A .txt中，这很好用。
我希望这是有意义的！这是我以前用来查找包含这些单词的句子。这段代码基于一个简单得多的单词列表，没有ID标签，我只是在寻找任何匹配，看看我的语料库是否至少包含一些例子。我如何才能最好地适应它？这花了我很长时间来编写，但我正在学习！
谢谢你，谢谢你

use strict;
use warnings;

my %words;
my %print;

open (IN2, "<LemmasFromTagset.MG.2022-10-17.txt"); #the wordlist

while (my $s = <IN2>)
{
    chomp $s;
    my $lcs = lc $s;
    $words{$lcs} = 1;
}
close(IN2);
open (OUT, ">TaggedSentences.txt"); #the final output with tagged sentences
open (OUT2, ">NotFound.txt"); #words for which there are no sentences

foreach my $word (sort keys (%words))
{
    open (IN,"<all-sentences_cleaned.tagged.desentensised.txt"); #the corpus
    
    print $word."\n";
    
    my $count = 0;
    
    while(my $s = <IN>)
    {
        chomp $s;
        my $lcs = lc $s;
        if ($lcs =~ /^(.*)(\W+)($word)(\W+)(.*)$/)
        {
        print OUT $word."\t".$s."\n";
        $count ++;
        }
    elsif ($lcs =~ /^($word)(\W+)(.*)$/)
    {
       print OUT $word."\t".$s."\n";
       $count ++;
    }
    }
    
    if ($count == 0)
    {
    print OUT2 $word."\n";
    }
    close(IN);
}
close(OUT);
close (OUT2);

perl

来源：https://stackoverflow.com/questions/74390729/tagging-words-in-a-corpus

2条答案

按热度按时间

hfsqlsce1#

我不确定我是否完全理解了您的代码逻辑，但替换非常简单。
1.处理密钥文件，把密钥变成小写，把值变成%words散列中的标签。例如$words{$key} = $value。我们可以用do语句快速简单地完成这一点，在这里我们用map语句处理文件。
1.使用alternator |创建一个正则表达式来搜索关键字。
1.读取输入文件，找到并捕获带有括号()的关键字，用\K保留匹配的单词，替换（添加）一个空格，标记open，用\L转义序列将哈希值变为小写，然后标记close。
1.打印。

use strict;
use warnings;

my %words = do { 
    open my $fh, "<", "words.tsv" or die $!;
    map { chomp; split /\t/ } <$fh>;
};
my $find = join '|', map lc, keys %words;
while (<DATA>) {
    s/($find)\K/ <$words{ "\L$1" }>/ig;
    print;
}

__DATA__
The boy and his dog met a hungry lion .
They needed to catch up on sleep after the dog and the boy came home .

如果你想让这个程序更灵活，你可以用<>替换<DATA>，并使用文件名作为参数来处理，然后将输出重定向到一个文件，例如：

$ perl words.pl corpus.txt > output.txt

赞(0）回复(0）举报 2022-11-15

bvn4nwqk2#

如果我的理解正确的话，你可能会从两个哈希表中受益，一个用于ID，一个用于频率。

my %words2ids;  # will be { "sleep" => "ID02", "boy" => "ID01", ...}

open(my $lemmas, "...") or die;
while (my $line = <$lemmas>) {
  chomp($line);
  my ($word, $id) = split "\t", $line;
  $words2ids{ lc($word) } = $id;   # note: lc($word)
}

接下来，浏览原始语料库的标记，计算并标记你感兴趣的标记：

my %freq;
open (my $output, "...") or die;
....
while (my $line = <$corpus>) {
  chomp($line);
  my @tokens = split ' ', $line;
  foreach my $token (@tokens) {
    my $lct = lc $token;
    if (my $id = $words2ids{ $lct }) { # false if no entry
      $freq{$lct}++;     # add one to the frequency count
      $token .= " $id";  # cheat a bit and append $id to the aliased token
    }
   }
  # now reconstruct the input line for our output
  say { $output } "@tokens"; # use feature qw(say)
}

在上面的例子中，我们运行了一次语料库，而不是在标记列表中每个单词运行一次。
现在，您的“NotFound”条目是%words2ids中没有出现在%freq中的那些键：

open (my $notfound, "...") or die;
foreach my $word (sort keys(%words2ids)) {
  next if exists $freq{$word};  # skip if we've seen it
  say { $notfound } "$word $words2ids{$word}";
}

这不是非常地道的Perl，也不是特别圆滑的Perl --而且它肯定没有经过测试！--但我认为它很好地抓住了您的问题和您想要采取的方法。

赞(0）回复(0）举报 2022-11-15

我来回答

perl 标注语料库中的单词

2条答案

相关问题

热门标签

最新问答