如何在R中合并来自不同目录的文件?

qmelpv7a  于 2023-04-03  发布在  其他
关注(0)|答案(1)|浏览(128)

我有一个名为simulations的文件夹,其中包含100个子文件夹,每个子文件夹中都包含模拟结果。每个子文件夹中的每个模拟结果都在四个单独的文件中,分别命名为seq[1].nexseq[2].nexseq[3].nexseq[4].nex。这些文件中的每个文件都具有相同的格式,如下所示:

#NEXUS

Begin data;
Dimensions ntax=5 nchar=55;
Format datatype=Standard symbols="01" missing=? gap=-;
Matrix
L1   1100110010010100010110000110000010000100001011010010110
L2   1101110110011010010000010111000010010000001001010110110
L3   0111111100010100010011000001100011010100010010110011110
L4   1101110110011010010000010111000010010000001001010110110
L5   1101110100110100010110010110001010010100001011010110100
;
End;

名为seq的文件具有相同的行数(即L1-L5),但它们的每行长度不同。例如,seq[2].nex如下所示:

#NEXUS

Begin data;
Dimensions ntax=5 nchar=20;
Format datatype=Standard symbols="012" missing=? gap=-;
Matrix
L1   10000012202011210001
L2   10002112212010210012
L3   10002112212210220022
L4   10002112212010220012
L5   10001112212010222012 
;
End;

对于100个子文件夹中的每一个,我都希望将seq[1].nexseq[2].nexseq[3].nexseq[4].nex合并到一个文件seq.nex中。(即,2-4)到第一个文件中相应的行。使用上面的两个示例,我想要的输出看起来像这样:

#NEXUS

Begin data;
Dimensions ntax=5 nchar=55;
Format datatype=Standard symbols="01" missing=? gap=-;
Matrix
L1   110011001001010001011000011000001000010000101101001011010000012202011210001
L2   110111011001101001000001011100001001000000100101011011010002112212010210012
L3   011111110001010001001100000110001101010001001011001111010002112212210220022
L4   110111011001101001000001011100001001000000100101011011010002112212010220012
L5   110111010011010001011001011000101001010000101101011010010001112212010222012
;
End;

然后我想重复这个过程,为100个子文件夹中的每个子文件夹合并文件。

oalqel3c

oalqel3c1#

这里有一种方法:

library(data.table)

# get path to simulations folder
pth_to_simulations = "simulations"

# get a list of all subfolders, with full names
fldrs = dir(pth_to_simulations, full.names=T)

# Create a function that ingests a subfolder, reads files, and concatenates
read_sims <- function(fldr) {
  sims = dir(fldr,full.names = T)
  sims = lapply(sims, fread, skip=6, nrows=5, header=F)
  sims = do.call(merge, c(by="V1", sims))
  sims[, .(V2 = paste0(c(.SD), collapse="")), V1]
}

# Apply the function to each of the fldrs in `simulations`
lapply(fldrs, read_sims)

如果示例文件在simulations/sim1中,则结果如下:

[[1]]
   V1                                                                          V2
1: L1 110011001001010001011000011000001000010000101101001011010000012202011210001
2: L2 110111011001101001000001011100001001000000100101011011010002112212010210012
3: L3 011111110001010001001100000110001101010001001011001111010002112212210220022
4: L4 110111011001101001000001011100001001000000100101011011010002112212010220012
5: L5 110111010011010001011001011000101001010000101101011010010001112212010222012

此输出是长度为1的列表,因为只有一个文件夹(`sim1)。您的输出将是长度为100的列表,其中每个元素包含连接的信息

相关问题