在R中修剪FASTA序列,在特定模式之前和之后

bejyjqdl  于 2023-02-17  发布在  其他
关注(0)|答案(1)|浏览(144)

我有一个fasta文件,由4个fasta序列组成。它们都由类似的蛋白质序列组成,其中几乎没有突变。该文件附在这里:https://drive.google.com/file/d/11OKZs47wOqYRw11Akwb4zj2RRzdqSQsC/view?usp=sharing
我想修剪所有具有特定模式的fasta序列。我想只选择以“QCVN...RAAR”开头的序列。但是,我不能直接使用过滤功能,因为“QCVN...RAAR”之间可能存在突变。这可能不会给予确切的序列。没有突变的部分是开始“QCVN”和结束“RRAR”在收集的4个fasta序列中保持相同。因此,是否可以修剪**“QCVN**“之前和**“RAAR”**之后的序列?

>UPH85748.1 |surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]
MFVFLVLLPLVSS"**QCVNLXTRTQSYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVT
WFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNA
TNVVIKVCEFQFCNDPFLDVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQ
GNFKNLREFVFKNIDGYFKIYSKHTPINLGRDLPQGFSALEPLVDLPIGINITRFQTLLA
LHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTL
KSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFDEVFNATRFASVYAWNRKRISNCVA
DYSVLYNLAPFFTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVXQXAPGQTGNIADYNY
KLPDDFTGCVIAWNSNKLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGNKPCNGV
AGFNCYFPLRSYGFRPTYGVGHQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNF
NGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTN
TSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEYVNNSYECD
IPIGAGICASYQTQTKSHRRAR**"SVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVT
TEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLKRALTGIAVEQDKNTQEVFA
QVKQIYKTPPIKYFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGD
IAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMA
YRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNHNAQALNTLV
KQLSSKFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASAN
LAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICH
DGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQP
ELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQEL
GKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEP
VLKGVKLHYT

下面是我尝试用来读取FASTA文件的代码

# Load the Biostrings package
library(Biostrings)

# Read the FASTA file
fasta <-readAAStringSet("sequences-2.fasta")
print(fasta)

它给出了这样的结果

AAStringSet object of length 4:
    width seq                                                                                              names               
[1]  1270 MFVFLVLLPLVSSQCVNLXTRTQSYTNSFTRGVYYPDKVFRSSVLHS...VMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT UPH85748.1 |surfa...
[2]  1270 MFVFLVLLPLVSSQCVNLITRTQSYTNSFTRGVYYPDKVFRSSVLHS...VMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT UUT03046.1 |surfa...
[3]  1270 MFVFLVLLPLVSSQCVNLRTRTQSYTNSFTRGVYYPDKVFRSSVLHS...VMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT UYE44393.1 |surfa...
[4]  1271 MFVFLVLLPLVSSQCVNFRTRTQLPPAYTNSFTRGVYYPDKVFRSSV...VMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT UYW62681.1 |surfa...

然后,我尝试使用此代码修剪“QCVN”之前的序列

# Loop through the sequences and trim the pattern
for (i in 1:length(fasta)) {
  sequence <- as.character(fasta[[i]]) # convert the sequence to a character string
  sequence <- gsub(paste("^.*", stop_pattern), stop_pattern, sequence) # remove everything that comes before the pattern
  fasta[[i]] <- AAString(sequence) # convert the trimmed sequence back to a AAString object
}

# Write the trimmed sequences to a new FASTA file
writeXStringSet(fasta, "file.fasta")

但是它不起作用,因为它没有从序列中修剪任何东西。有没有可能的方法来修剪**“QCVN**“之前和**“RAAR”**之后的序列?

py49o6xq

py49o6xq1#

您可以只使用一个gsub用于两个。如果中间部分在保存到\\1时与组.*(pattern).*匹配,则可以保留中间部分

example <- "somethingbadQCVNsomethingelsebutgoodRAARsomethingbad"

gsub(".*(QCVN.*RAAR).*", "\\1", example)

结果

[1] "QCVNsomethingelsebutgoodRAAR"

有关详细信息,请参见this问题

相关问题