R脚本,用于批处理目录中的所有.tsv文件,使其具有包含其他列信息的新列

xjreopfe  于 2023-02-14  发布在  其他
关注(0)|答案(1)|浏览(124)

我想将几个步骤合并到一个R脚本中,以执行以下操作:
1.加载一个.tsv文件接另一个(在一个目录中有数百个)
1.融合这些文件中的3个特定列,以形成新列“Fusion”
1.我把这些文件输出到旧的.tsv文件中(这样我就不会得到几百个新文件)
下面的步骤是可行的,但恐怕它们非常笨拙(我真的不擅长编码),而且它们不是批处理的,必须一个接一个地放入。

test <- read.table(
   "1.tsv",
   sep="\t", header=TRUE)

test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)

write.table(test, file = "1.tsv", append = FALSE, quote = TRUE, sep = "\t",
                 eol = "\n", na = "NA", dec = ".", row.names = TRUE,
                 col.names = TRUE, qmethod = c("escape", "double"),
                 fileEncoding = "")

正如您所看到的,文件必须手动一次一个地放入,并且数据框“test”似乎也是多余的(?)。
如果有人能把这些放在一个脚本中,那就太好了,这个脚本只需使用R的工作目录,一个接一个地浏览文件,添加一个新的“Fusion”列,写入新的.tsv文件,然后继续前进。
谢谢你的帮助!

9cbw7uwe

9cbw7uwe1#

下面是我将使用您的方法为pwd中的每个文件循环代码所做的工作。确保在目标. tsv文件所在的目录中运行此脚本。

#!/usr/bin/Rscript

print(getwd()) ## print the pwd to the standard output to ensure that you are in the
               ## right directory
files<-list.files(".",pattern="*.tsv") ## List all files in the pwd that end in .tsv
cols2fuse<-c("amino_acid","v_gene","j_gene") ## Parametrized the columns to fuse
prefix<-"fused-" ## Include this so that you don't overwrite your old files while testing
                 ## you can always delete them later

fuseColumns<-function(file,cols2fuse){
    test<-read.table(file,sep="\t",header=TRUE)
    test$Fusion <- paste0(test$amino_acid,test$v_gene,test$j_gene)
    write.table(test,
                file =paste0(prefix,file), # only works if preformed in pwd
                                        ## otherwise you may end up with
                                        ## Something like: 
                                        ## "fused-/home/username/file/1.tsv"                 
                sep = "\t",
                quote = TRUE, ## this will suround each output 
                              ## value in quotes.
                              ## this may not be desirable
                row.names = TRUE, ## Do you really want the row names
                                  ## included?
                col.names = TRUE)
    file ## return the file that has been edited (this will show up in stdout
}

lapply(files,fuseColumns,cols2fuse) ## Apply fuseColumns to all .tsv fusing
                                    ## columns with names that
                                    ## match those in cols2fuse

样品输入

amino_acid  v_gene  j_gene
amino1  ENS0001001  ENS0002001
amino2  ENS0003001  ENS0004001
amino3  ENS0005001  ENS0006001
amino4  ENS0007001  ENS0008001

被转化成

"amino_acid"    "v_gene"    "j_gene"    "Fusion"
"1" "amino1"    "ENS0001001"    "ENS0002001"    "amino1ENS0001001ENS0002001"
"2" "amino2"    "ENS0003001"    "ENS0004001"    "amino2ENS0003001ENS0004001"
"3" "amino3"    "ENS0005001"    "ENS0006001"    "amino3ENS0005001ENS0006001"
"4" "amino4"    "ENS0007001"    "ENS0008001"    "amino4ENS0007001ENS0008001"

要删除每个元素周围的引号,请将quote设置为FALSE;要删除每行开头的数字,请将row.names设置为FALSE

write.table(test,
            file =paste0(prefix,file),                  
            sep = "\t",
            quote = FALSE,
            row.names = FALSE,                       
            col.names = TRUE)

输出现在如下所示

amino_acid  v_gene  j_gene  Fusion
amino1  ENS0001001  ENS0002001  amino1ENS0001001ENS0002001
amino2  ENS0003001  ENS0004001  amino2ENS0003001ENS0004001
amino3  ENS0005001  ENS0006001  amino3ENS0005001ENS0006001
amino4  ENS0007001  ENS0008001  amino4ENS0007001ENS0008001

我不确定你所说的多余是指你想去掉三根柱子,只显示融合的柱子吗?
您可以使用类似下面的方法来标识冗余列

redundantCols<-unlist(sapply(colnames(test),`%in%`,cols2fuse))

相关问题