我有一个fasta文件,由4个fasta序列组成。它们都由类似的蛋白质序列组成,其中几乎没有突变。该文件附在这里:https://drive.google.com/file/d/11OKZs47wOqYRw11Akwb4zj2RRzdqSQsC/view?usp=sharing
我想修剪所有具有特定模式的fasta序列。我想只选择以“QCVN...RAAR”开头的序列。但是,我不能直接使用过滤功能,因为“QCVN...RAAR”之间可能存在突变。这可能不会给予确切的序列。没有突变的部分是开始“QCVN”和结束“RRAR”在收集的4个fasta序列中保持相同。因此,是否可以修剪**“QCVN**“之前和**“RAAR”**之后的序列?
>UPH85748.1 |surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]
MFVFLVLLPLVSS"**QCVNLXTRTQSYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVT
WFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNA
TNVVIKVCEFQFCNDPFLDVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQ
GNFKNLREFVFKNIDGYFKIYSKHTPINLGRDLPQGFSALEPLVDLPIGINITRFQTLLA
LHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTL
KSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFDEVFNATRFASVYAWNRKRISNCVA
DYSVLYNLAPFFTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVXQXAPGQTGNIADYNY
KLPDDFTGCVIAWNSNKLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGNKPCNGV
AGFNCYFPLRSYGFRPTYGVGHQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNF
NGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTN
TSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEYVNNSYECD
IPIGAGICASYQTQTKSHRRAR**"SVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVT
TEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLKRALTGIAVEQDKNTQEVFA
QVKQIYKTPPIKYFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGD
IAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMA
YRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNHNAQALNTLV
KQLSSKFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASAN
LAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICH
DGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQP
ELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQEL
GKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEP
VLKGVKLHYT
下面是我尝试用来读取FASTA文件的代码
# Load the Biostrings package
library(Biostrings)
# Read the FASTA file
fasta <-readAAStringSet("sequences-2.fasta")
print(fasta)
它给出了这样的结果
AAStringSet object of length 4:
width seq names
[1] 1270 MFVFLVLLPLVSSQCVNLXTRTQSYTNSFTRGVYYPDKVFRSSVLHS...VMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT UPH85748.1 |surfa...
[2] 1270 MFVFLVLLPLVSSQCVNLITRTQSYTNSFTRGVYYPDKVFRSSVLHS...VMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT UUT03046.1 |surfa...
[3] 1270 MFVFLVLLPLVSSQCVNLRTRTQSYTNSFTRGVYYPDKVFRSSVLHS...VMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT UYE44393.1 |surfa...
[4] 1271 MFVFLVLLPLVSSQCVNFRTRTQLPPAYTNSFTRGVYYPDKVFRSSV...VMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT UYW62681.1 |surfa...
然后,我尝试使用此代码修剪“QCVN”之前的序列
# Loop through the sequences and trim the pattern
for (i in 1:length(fasta)) {
sequence <- as.character(fasta[[i]]) # convert the sequence to a character string
sequence <- gsub(paste("^.*", stop_pattern), stop_pattern, sequence) # remove everything that comes before the pattern
fasta[[i]] <- AAString(sequence) # convert the trimmed sequence back to a AAString object
}
# Write the trimmed sequences to a new FASTA file
writeXStringSet(fasta, "file.fasta")
但是它不起作用,因为它没有从序列中修剪任何东西。有没有可能的方法来修剪**“QCVN**“之前和**“RAAR”**之后的序列?
1条答案
按热度按时间py49o6xq1#
您可以只使用一个
gsub
用于两个。如果中间部分在保存到\\1
时与组.*(pattern).*
匹配,则可以保留中间部分结果
有关详细信息,请参见this问题