如何检索数据,直到一个关键字在基因库与Perl?

insrf1ej  于 2023-06-06  发布在  Perl
关注(0)|答案(1)|浏览(173)

我在写一个脚本从基因库中检索数据。我只需要注解的COMMENT部分之前的信息。
这是我的输入:

LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the
            Ensembl or EnsemblGenomes web site, http://www.ensembl.org/ or
            http://www.ensemblgenomes.org/ for more information.

当前脚本:

$genbank = <STDIN>;
chomp ($genbank);
open (READ, "<$genbank") or die;
@data = <READ>;
close READ;

$end= $#data;
for ($line= 0; $line<= $end; $line++){
    if ($data[$line] =~ /LOCUS/){
        @annotation = (@annotation, $data[$line]);
        until ($data[$line] =~ /COMMENT/){
            $line++;
            @annotation = (@annotation, $data[$line]);
}}}
print @annotation;

其输出:

LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the

正如你所看到的,这个方法有一个问题。
如何修改代码,使其检索数据,但在COMMENT处停止,而不检索整行?
所有GenBank文件的第一行都以LOCUS开头,我想这可以用来编写更好的代码(因此可以在没有正则表达式匹配的情况下完成)。我不知道该怎么做。我真的很感激你的输入!!

qyswt5oh

qyswt5oh1#

这看起来比它需要的要复杂得多。我会用这样的话:

use strict;
use warnings;

my $print = 0;
open(IN, '<', 'genbank.txt');
open(OUT, ">genbank-out-without-comment.txt") or die "opsala $!";
while(<IN>){
  if(/^LOCUS\s/){
    $print = 1;
  }
  if(/^COMMENT\s/i){
    print "\n"; # preserve new line between entries
    $print = 0;
  }
  print OUT if $print;
}
close(OUT);

相关问题