R语言 读取和合并计算机集群上的大型表

dhxwm5r4  于 2023-04-03  发布在  其他
关注(0)|答案(2)|浏览(116)

我需要将不同的大表(每个最多10 Gb)合并成一个表。为此,我使用了一个在Linux上运行的具有50多个核心和10多Gb RAM的计算机集群。
我总是以这样的错误消息结束:“Cannot allocate vector of size X Mb”.由于像memory.limit(size=X)这样的命令是Windows特定的,不被接受,我找不到一种方法来合并我的大表。
欢迎任何建议!
这是我使用的代码:

library(parallel)

no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)

temp = list.files(pattern="*.txt$")
gc()

此时出现错误:

myfiles = parLapply(cl,temp, function(x) read.csv(x,
                                        header=TRUE, 
                                        sep=";",
                                        stringsAsFactors=F,
                                        encoding = "UTF-8",
                                        na.strings = c("NA","99","")))


myfiles.final = do.call(rbind, myfiles)
ih99xse1

ih99xse11#

您可以只使用merge,例如:

`
mergedTable <- merge(table1, table2, by = "dbSNP_RSID")
If your samples have overlapping column names, then you'll find that the mergedTable has (for example) columns called Sample1.x and Sample1.y. This can be fixed by renaming the columns before or after the merge.

Reproducible example:

x <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
  matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
    sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
y <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
  matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
    sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
colnames(x)[2:101] <- paste0("Sample", 1:100)
colnames(y)[2:101] <- paste0("Sample", 101:200)
mergedDf <- merge(x, y, by = "dbSNP_RSID")

sqxo8psd

sqxo8psd2#

一种方法是使用python和dask。dask dataframe主要存储在磁盘上而不是ram中-允许您处理比ram更大的数据-并且可以帮助您通过巧妙的并行化进行计算。在kaggle帖子中可以找到一个很好的教程来处理大数据,这可能对您也有帮助。我还建议查看dask performance here的文档。需要明确的是,如果您的数据可以容纳RAM,使用常规R Dataframe 或pandas Dataframe 将更快。
这里有一个dask解决方案,它将假设您在表中有命名列来对齐concat操作。如果您对我们需要考虑的数据有任何其他特殊要求,请添加到您的问题中。

import dask.dataframe as dd
import glob

tmp = glob.glob("*.txt")

dfs= []
for f in tmp:
    # read the large tables
    ddf = dd.read_table(f)
    # make a list of all the dfs
    dfs.append(ddf)

#row-wise concat of the data
dd_all = dd.concat(dfs)
#repartition the df to 1 partition for saving
dd_all = dd_all.repartition(npartitions=1)

# save the data 
# provide list of one name if you don't want the partition number appended on
dd_all.to_csv(['all_big_files.tsv'], sep = '\t')

如果你只是想把所有的表都放在一起,你可以直接用python来做(你也可以用linux的cat/paste)。

with open('all_big_files.tsv', 'w') as O:
    file_number = 0
    for f in tmp:
        with open(f, 'rU') as F:
            if file_number == 0:
                for line in F:
                    line = line.rstrip()
                    O.write(line + '\n')
            else:
                # skip the header line
                l = F.readline()
                for line in F:
                    line = line.rstrip()
                    O.write(line + '\n')
            file_number +=1

相关问题