R脚本，用于批处理目录中的所有.tsv文件，使其具有包含其他列信息的新列

xjreopfe 于 2023-02-14 发布在其他

关注(0)|答案(1)|浏览(124)

我想将几个步骤合并到一个R脚本中，以执行以下操作：
1.加载一个.tsv文件接另一个（在一个目录中有数百个）
1.融合这些文件中的3个特定列，以形成新列“Fusion”
1.我把这些文件输出到旧的.tsv文件中（这样我就不会得到几百个新文件）
下面的步骤是可行的，但恐怕它们非常笨拙（我真的不擅长编码），而且它们不是批处理的，必须一个接一个地放入。

test <- read.table(
   "1.tsv",
   sep="\t", header=TRUE)

test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)

write.table(test, file = "1.tsv", append = FALSE, quote = TRUE, sep = "\t",
                 eol = "\n", na = "NA", dec = ".", row.names = TRUE,
                 col.names = TRUE, qmethod = c("escape", "double"),
                 fileEncoding = "")

正如您所看到的，文件必须手动一次一个地放入，并且数据框“test”似乎也是多余的（？）。
如果有人能把这些放在一个脚本中，那就太好了，这个脚本只需使用R的工作目录，一个接一个地浏览文件，添加一个新的“Fusion”列，写入新的.tsv文件，然后继续前进。
谢谢你的帮助！

r

来源：https://stackoverflow.com/questions/40998385/r-script-to-batch-all-tsv-files-in-a-directory-to-have-a-new-column-with-inform

1条答案

按热度按时间

9cbw7uwe1#

下面是我将使用您的方法为pwd中的每个文件循环代码所做的工作。确保在目标. tsv文件所在的目录中运行此脚本。

#!/usr/bin/Rscript

print(getwd()) ## print the pwd to the standard output to ensure that you are in the
               ## right directory
files<-list.files(".",pattern="*.tsv") ## List all files in the pwd that end in .tsv
cols2fuse<-c("amino_acid","v_gene","j_gene") ## Parametrized the columns to fuse
prefix<-"fused-" ## Include this so that you don't overwrite your old files while testing
                 ## you can always delete them later

fuseColumns<-function(file,cols2fuse){
    test<-read.table(file,sep="\t",header=TRUE)
    test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)
    write.table(test,
                file =paste0(prefix,file), # only works if preformed in pwd
                                        ## otherwise you may end up with
                                        ## Something like: 
                                        ## "fused-/home/username/file/1.tsv"                 
                sep = "\t",
                quote = TRUE, ## this will suround each output 
                              ## value in quotes.
                              ## this may not be desirable
                row.names = TRUE, ## Do you really want the row names
                                  ## included?
                col.names = TRUE)
    file ## return the file that has been edited (this will show up in stdout
}

lapply(files,fuseColumns,cols2fuse) ## Apply fuseColumns to all .tsv fusing
                                    ## columns with names that
                                    ## match those in cols2fuse

样品输入

amino_acid  v_gene  j_gene
amino1  ENS0001001  ENS0002001
amino2  ENS0003001  ENS0004001
amino3  ENS0005001  ENS0006001
amino4  ENS0007001  ENS0008001

被转化成

"amino_acid"    "v_gene"    "j_gene"    "Fusion"
"1" "amino1"    "ENS0001001"    "ENS0002001"    "amino1ENS0001001ENS0002001"
"2" "amino2"    "ENS0003001"    "ENS0004001"    "amino2ENS0003001ENS0004001"
"3" "amino3"    "ENS0005001"    "ENS0006001"    "amino3ENS0005001ENS0006001"
"4" "amino4"    "ENS0007001"    "ENS0008001"    "amino4ENS0007001ENS0008001"

要删除每个元素周围的引号，请将quote设置为FALSE;要删除每行开头的数字，请将row.names设置为FALSE。

write.table(test,
            file =paste0(prefix,file),                  
            sep = "\t",
            quote = FALSE,
            row.names = FALSE,                       
            col.names = TRUE)

输出现在如下所示

amino_acid  v_gene  j_gene  Fusion
amino1  ENS0001001  ENS0002001  amino1ENS0001001ENS0002001
amino2  ENS0003001  ENS0004001  amino2ENS0003001ENS0004001
amino3  ENS0005001  ENS0006001  amino3ENS0005001ENS0006001
amino4  ENS0007001  ENS0008001  amino4ENS0007001ENS0008001

我不确定你所说的多余是指你想去掉三根柱子，只显示融合的柱子吗？
您可以使用类似下面的方法来标识冗余列

redundantCols<-unlist(sapply(colnames(test),`%in%`,cols2fuse))

赞(0）回复(0）举报 2023-02-14

我来回答

R脚本，用于批处理目录中的所有.tsv文件，使其具有包含其他列信息的新列

1条答案

相关问题

热门标签

最新问答