R语言 如何在没有包的情况下读取一个有多个序列的fasta?

yqlxgs2m  于 2022-12-25  发布在  其他
关注(0)|答案(1)|浏览(190)

I have a sequence looks like this

>gi|1079586|gb|AAA82053.1| polymerase [Human endogenous retrovirus K]
IQKTSGRWRMLTDLRAVNAVIQPKGHLQPGLPSPAMIPKDWPLIIIDLKDCFFTIPLEEQDFEKFAFTIP
AINNKEPATRF
>gi|138362618|gb|ABO76712.1| polymerase [Kodoko virus]
SERESNSESLSKALSLTKCLSAALKNLCFYSEESPTSYTSVGPDSGRLKFALSYKEQVGGNRELYIGDLR
TKMFTRLVEDYFESFTSFFHGSCLNDEKEFENAILSMTVNVRQGYLSYSMDHSKWGPM
>gi|387147|gb|AAA37560.1| polymerase [Mus musculus]
PSLQAHLQALQAVQREVWKPLAAAYQDQQDQPVIPHPFRVGDTVWVRRHQTKNLEPRWKGPYTVLLTTPT
ALKVDGIAAWIHAAHVKAATTPPAGTASGPTWKVQRSQNPLKIRLTRGAP
>gi|74474929|dbj|BAE44450.1| polymerase [Hepatitis B virus]
MPLSYQHFRKLLLLDDEAGPLEEELPRLADEGLNRRVAEDLNLGNLNVSIPWTHKVGNFTGLYSSTVPVF
NPEWQTPSFPNIHLQEDIINRCQQYVGPLTVNEKRRLKLIMPARFYPNLTKYLPLDKGIKPYYPEHAVNH
YFKTRHYLHTLWKAGILYKRETTRSASFCGSPYSWEQELQHGRLVFQTSTRHGDESFCSQSSGILSRSPV
GPCVRSQLKQSRLGLQPQQGSLARGKSGRSGSIRARVHPTTRRSFGVEPSGSGHIDNSASSASSCLHQSA
VRKTAYSHLSTSKRQSSSGHAMELHHIPPSSARSQSEGPIFSCWWLQFRNSKPCSDYCLTHIVNLLEDWG
PCTEHGEHNIRIPRTPARVTGGVFLVDKNPHNTTESRLVVDFSQFSRGSTHVSWPKFAVPNLQSLTNLLS
SNLSWLSLDVSAAFYHIPLHPAAMPHLLVGSSGLPRYVARLSSTSRNINYQHGTMQDLHDSCSRNLYVSL
LLLYKTFGRKLHLYSHPIILGFRKIPMGVGLSPFLLGQFTSAICSVVRRAFPHCLAFSYMDDVVLGAKSV
QHLESLFTSITNFLLSLGIHLNPNKTKRWGYSLNFMGYVIGSWGTLPQEHIVLKLKQCFRKLPVNRPIDW
KVCQRIVGLLGFAAPFTQCGYPALMPLYACIQAKQAFTFSPTYKAFLCQQYLHLYPVGRQRSGLCQVFAD
ATPTGWGLAIGHRRMRGTFVAPLPIHTAELLAACFARSRSGAKLIGTDNSVVLSRKYTSFPWLLGCAANW
ILRGTSFVYVPSALNPADDPSRGRLGLYRPLLHLPFRPTTGRTSLYAVSPSVPSHLPDRVHFASPLHVAW
RPP

My task is to read this file, but without using a package. When reading in, I want to make sure that the file is actually in FastA format. The content of the FastA file is to be output in a suitable structure, so that header and sequence can be retrieved individually.
I have tried this first for a sequence that looks like this:

con <- "URL where the file is stored"

data <- readLines(con)

list(description = data[1],
     sequence = paste(data[-1], collapse = ""))

$description [1] ">gi|1079586|gb|AAA82053.1| polymerase [Human endogenous retrovirus K]"
$sequence [1] "IQKTSGRWRMLTDLRAVNAVIQPKGHLQPGLPSPAMIPKDWPLIIIDLKDCFFTIPLEEQDFEKFAFTIPAINNKEPATRF>gi|138362618|gb|ABO76712.1| polymerase [Kodoko virus]SERESNSESLSKALSLTKCLSAALKNLCFYSEESPTSYTSVGPDSGRLKFALSYKEQVGGNRELYIGDLRTKMFTRLVEDYFESFTSFFHGSCLNDEKEFENAILSMTVNVRQGYLSYSMDHSKWGPM>gi|387147|gb|AAA37560.1| polymerase [Mus musculus]PSLQAHLQALQAVQREVWKPLAAAYQDQQDQPVIPHPFRVGDTVWVRRHQTKNLEPRWKGPYTVLLTTPTALKVDGIAAWIHAAHVKAATTPPAGTASGPTWKVQRSQNPLKIRLTRGAP>gi|74474929|dbj|BAE44450.1| polymerase [Hepatitis B virus]MPLSYQHFRKLLLLDDEAGPLEEELPRLADEGLNRRVAEDLNLGNLNVSIPWTHKVGNFTGLYSSTVPVFNPEWQTPSFPNIHLQEDIINRCQQYVGPLTVNEKRRLKLIMPARFYPNLTKYLPLDKGIKPYYPEHAVNHYFKTRHYLHTLWKAGILYKRETTRSASFCGSPYSWEQELQHGRLVFQTSTRHGDESFCSQSSGILSRSPVGPCVRSQLKQSRLGLQPQQGSLARGKSGRSGSIRARVHPTTRRSFGVEPSGSGHIDNSASSASSCLHQSAVRKTAYSHLSTSKRQSSSGHAMELHHIPPSSARSQSEGPIFSCWWLQFRNSKPCSDYCLTHIVNLLEDWGPCTEHGEHNIRIPRTPARVTGGVFLVDKNPHNTTESRLVVDFSQFSRGSTHVSWPKFAVPNLQSLTNLLSSNLSWLSLDVSAAFYHIPLHPAAMPHLLVGSSGLPRYVARLSSTSRNINYQHGTMQDLHDSCSRNLYVSLLLLYKTFGRKLHLYSHPIILGFRKIPMGVGLSPFLLGQFTSAICSVVRRAFPHCLAFSYMDDVVLGAKSVQHLESLFTSITNFLLSLGIHLNPNKTKRWGYSLNFMGYVIGSWGTLPQEHIVLKLKQCFRKLPVNRPIDWKVCQRIVGLLGFAAPFTQCGYPALMPLYACIQAKQAFTFSPTYKAFLCQQYLHLYPVGRQRSGLCQVFADATPTGWGLAIGHRRMRGTFVAPLPIHTAELLAACFARSRSGAKLIGTDNSVVLSRKYTSFPWLLGCAANWILRGTSFVYVPSALNPADDPSRGRLGLYRPLLHLPFRPTTGRTSLYAVSPSVPSHLPDRVHFASPLHVAWRPP"
But now I want to output all sequences in the above file separately like this. For this I had the following approach:

data <- paste(readLines(con), collapse = "")
strsplit(data, ">|]")

Problem with this is that the separator is no longer in the separated sequences, the output looks very strange (so the list has very large gaps between the elements) and I can't change the name of the list elements yet.
Any suggestions on how to do this better?

deikduxw

deikduxw1#

我的方法是将其拆分为以“〉”开头的片段,然后需要在描述(1行)和序列(所有其他行)之间拆分每个配对

library(tidyverse)

text <- ">gi|1079586|gb|AAA82053.1| polymerase [Human endogenous retrovirus K]
IQKTSGRWRMLTDLRAVNAVIQPKGHLQPGLPSPAMIPKDWPLIIIDLKDCFFTIPLEEQDFEKFAFTIP
AINNKEPATRF
>gi|138362618|gb|ABO76712.1| polymerase [Kodoko virus]
SERESNSESLSKALSLTKCLSAALKNLCFYSEESPTSYTSVGPDSGRLKFALSYKEQVGGNRELYIGDLR
TKMFTRLVEDYFESFTSFFHGSCLNDEKEFENAILSMTVNVRQGYLSYSMDHSKWGPM
>gi|387147|gb|AAA37560.1| polymerase [Mus musculus]
PSLQAHLQALQAVQREVWKPLAAAYQDQQDQPVIPHPFRVGDTVWVRRHQTKNLEPRWKGPYTVLLTTPT
ALKVDGIAAWIHAAHVKAATTPPAGTASGPTWKVQRSQNPLKIRLTRGAP
>gi|74474929|dbj|BAE44450.1| polymerase [Hepatitis B virus]
MPLSYQHFRKLLLLDDEAGPLEEELPRLADEGLNRRVAEDLNLGNLNVSIPWTHKVGNFTGLYSSTVPVF
NPEWQTPSFPNIHLQEDIINRCQQYVGPLTVNEKRRLKLIMPARFYPNLTKYLPLDKGIKPYYPEHAVNH
YFKTRHYLHTLWKAGILYKRETTRSASFCGSPYSWEQELQHGRLVFQTSTRHGDESFCSQSSGILSRSPV
GPCVRSQLKQSRLGLQPQQGSLARGKSGRSGSIRARVHPTTRRSFGVEPSGSGHIDNSASSASSCLHQSA
VRKTAYSHLSTSKRQSSSGHAMELHHIPPSSARSQSEGPIFSCWWLQFRNSKPCSDYCLTHIVNLLEDWG
PCTEHGEHNIRIPRTPARVTGGVFLVDKNPHNTTESRLVVDFSQFSRGSTHVSWPKFAVPNLQSLTNLLS
SNLSWLSLDVSAAFYHIPLHPAAMPHLLVGSSGLPRYVARLSSTSRNINYQHGTMQDLHDSCSRNLYVSL
LLLYKTFGRKLHLYSHPIILGFRKIPMGVGLSPFLLGQFTSAICSVVRRAFPHCLAFSYMDDVVLGAKSV
QHLESLFTSITNFLLSLGIHLNPNKTKRWGYSLNFMGYVIGSWGTLPQEHIVLKLKQCFRKLPVNRPIDW
KVCQRIVGLLGFAAPFTQCGYPALMPLYACIQAKQAFTFSPTYKAFLCQQYLHLYPVGRQRSGLCQVFAD
ATPTGWGLAIGHRRMRGTFVAPLPIHTAELLAACFARSRSGAKLIGTDNSVVLSRKYTSFPWLLGCAANW
ILRGTSFVYVPSALNPADDPSRGRLGLYRPLLHLPFRPTTGRTSLYAVSPSVPSHLPDRVHFASPLHVAW
RPP"

(pairings <- unlist(str_split(string = text,">")))

pairings_clean <- lapply(pairings, function(x) {
  if(nchar(x)==0) return(NULL)
  split_text <- str_split(x, fixed("\n"),n = 2)
  description <- split_text[[1]][1]
  sequence = split_text[[1]][2]
  #clean up sequence further by dropping \n ? 
  if(identical(sequence,character(0)))
    sequence <- ""

  data.frame(description = description, sequence = sequence)
}) |> bind_rows() |> tibble()

相关问题