I have a sequence looks like this
>gi|1079586|gb|AAA82053.1| polymerase [Human endogenous retrovirus K]
IQKTSGRWRMLTDLRAVNAVIQPKGHLQPGLPSPAMIPKDWPLIIIDLKDCFFTIPLEEQDFEKFAFTIP
AINNKEPATRF
>gi|138362618|gb|ABO76712.1| polymerase [Kodoko virus]
SERESNSESLSKALSLTKCLSAALKNLCFYSEESPTSYTSVGPDSGRLKFALSYKEQVGGNRELYIGDLR
TKMFTRLVEDYFESFTSFFHGSCLNDEKEFENAILSMTVNVRQGYLSYSMDHSKWGPM
>gi|387147|gb|AAA37560.1| polymerase [Mus musculus]
PSLQAHLQALQAVQREVWKPLAAAYQDQQDQPVIPHPFRVGDTVWVRRHQTKNLEPRWKGPYTVLLTTPT
ALKVDGIAAWIHAAHVKAATTPPAGTASGPTWKVQRSQNPLKIRLTRGAP
>gi|74474929|dbj|BAE44450.1| polymerase [Hepatitis B virus]
MPLSYQHFRKLLLLDDEAGPLEEELPRLADEGLNRRVAEDLNLGNLNVSIPWTHKVGNFTGLYSSTVPVF
NPEWQTPSFPNIHLQEDIINRCQQYVGPLTVNEKRRLKLIMPARFYPNLTKYLPLDKGIKPYYPEHAVNH
YFKTRHYLHTLWKAGILYKRETTRSASFCGSPYSWEQELQHGRLVFQTSTRHGDESFCSQSSGILSRSPV
GPCVRSQLKQSRLGLQPQQGSLARGKSGRSGSIRARVHPTTRRSFGVEPSGSGHIDNSASSASSCLHQSA
VRKTAYSHLSTSKRQSSSGHAMELHHIPPSSARSQSEGPIFSCWWLQFRNSKPCSDYCLTHIVNLLEDWG
PCTEHGEHNIRIPRTPARVTGGVFLVDKNPHNTTESRLVVDFSQFSRGSTHVSWPKFAVPNLQSLTNLLS
SNLSWLSLDVSAAFYHIPLHPAAMPHLLVGSSGLPRYVARLSSTSRNINYQHGTMQDLHDSCSRNLYVSL
LLLYKTFGRKLHLYSHPIILGFRKIPMGVGLSPFLLGQFTSAICSVVRRAFPHCLAFSYMDDVVLGAKSV
QHLESLFTSITNFLLSLGIHLNPNKTKRWGYSLNFMGYVIGSWGTLPQEHIVLKLKQCFRKLPVNRPIDW
KVCQRIVGLLGFAAPFTQCGYPALMPLYACIQAKQAFTFSPTYKAFLCQQYLHLYPVGRQRSGLCQVFAD
ATPTGWGLAIGHRRMRGTFVAPLPIHTAELLAACFARSRSGAKLIGTDNSVVLSRKYTSFPWLLGCAANW
ILRGTSFVYVPSALNPADDPSRGRLGLYRPLLHLPFRPTTGRTSLYAVSPSVPSHLPDRVHFASPLHVAW
RPP
My task is to read this file, but without using a package. When reading in, I want to make sure that the file is actually in FastA format. The content of the FastA file is to be output in a suitable structure, so that header and sequence can be retrieved individually.
I have tried this first for a sequence that looks like this:
con <- "URL where the file is stored"
data <- readLines(con)
list(description = data[1],
sequence = paste(data[-1], collapse = ""))
$description [1] ">gi|1079586|gb|AAA82053.1| polymerase [Human endogenous retrovirus K]"
$sequence [1] "IQKTSGRWRMLTDLRAVNAVIQPKGHLQPGLPSPAMIPKDWPLIIIDLKDCFFTIPLEEQDFEKFAFTIPAINNKEPATRF>gi|138362618|gb|ABO76712.1| polymerase [Kodoko virus]SERESNSESLSKALSLTKCLSAALKNLCFYSEESPTSYTSVGPDSGRLKFALSYKEQVGGNRELYIGDLRTKMFTRLVEDYFESFTSFFHGSCLNDEKEFENAILSMTVNVRQGYLSYSMDHSKWGPM>gi|387147|gb|AAA37560.1| polymerase [Mus musculus]PSLQAHLQALQAVQREVWKPLAAAYQDQQDQPVIPHPFRVGDTVWVRRHQTKNLEPRWKGPYTVLLTTPTALKVDGIAAWIHAAHVKAATTPPAGTASGPTWKVQRSQNPLKIRLTRGAP>gi|74474929|dbj|BAE44450.1| polymerase [Hepatitis B virus]MPLSYQHFRKLLLLDDEAGPLEEELPRLADEGLNRRVAEDLNLGNLNVSIPWTHKVGNFTGLYSSTVPVFNPEWQTPSFPNIHLQEDIINRCQQYVGPLTVNEKRRLKLIMPARFYPNLTKYLPLDKGIKPYYPEHAVNHYFKTRHYLHTLWKAGILYKRETTRSASFCGSPYSWEQELQHGRLVFQTSTRHGDESFCSQSSGILSRSPVGPCVRSQLKQSRLGLQPQQGSLARGKSGRSGSIRARVHPTTRRSFGVEPSGSGHIDNSASSASSCLHQSAVRKTAYSHLSTSKRQSSSGHAMELHHIPPSSARSQSEGPIFSCWWLQFRNSKPCSDYCLTHIVNLLEDWGPCTEHGEHNIRIPRTPARVTGGVFLVDKNPHNTTESRLVVDFSQFSRGSTHVSWPKFAVPNLQSLTNLLSSNLSWLSLDVSAAFYHIPLHPAAMPHLLVGSSGLPRYVARLSSTSRNINYQHGTMQDLHDSCSRNLYVSLLLLYKTFGRKLHLYSHPIILGFRKIPMGVGLSPFLLGQFTSAICSVVRRAFPHCLAFSYMDDVVLGAKSVQHLESLFTSITNFLLSLGIHLNPNKTKRWGYSLNFMGYVIGSWGTLPQEHIVLKLKQCFRKLPVNRPIDWKVCQRIVGLLGFAAPFTQCGYPALMPLYACIQAKQAFTFSPTYKAFLCQQYLHLYPVGRQRSGLCQVFADATPTGWGLAIGHRRMRGTFVAPLPIHTAELLAACFARSRSGAKLIGTDNSVVLSRKYTSFPWLLGCAANWILRGTSFVYVPSALNPADDPSRGRLGLYRPLLHLPFRPTTGRTSLYAVSPSVPSHLPDRVHFASPLHVAWRPP"
But now I want to output all sequences in the above file separately like this. For this I had the following approach:
data <- paste(readLines(con), collapse = "")
strsplit(data, ">|]")
Problem with this is that the separator is no longer in the separated sequences, the output looks very strange (so the list has very large gaps between the elements) and I can't change the name of the list elements yet.
Any suggestions on how to do this better?
1条答案
按热度按时间deikduxw1#
我的方法是将其拆分为以“〉”开头的片段,然后需要在描述(1行)和序列(所有其他行)之间拆分每个配对