在R中上传文本文档

klr1opcd  于 2023-01-10  发布在  其他
关注(0)|答案(1)|浏览(144)

我正在尝试将几个文本文档上传到R中的一个数据框中。我想要的输出是一个包含两列的矩阵:
| 文件|内容|
| - ------| - ------|
| 文件A|这是内容。|
| :----|:-------:|
| 文件B|这是内容。|
| :----|:-------:|
| 文件C|这是内容。|
在"内容"列中,应显示文本文档(10-K报告)中的所有文本信息。

> setwd("C:/Users/folder")
> folder <- getwd()
> corpus <- Corpus(DirSource(directory = folder, pattern = "*.txt"))

这将创建一个语料库,我可以标记它。但我没有实现转换为 Dataframe 或我渴望的输出。
有人能帮帮我吗?

2g32fytz

2g32fytz1#

如果你只处理.txt文件,并且你的最终目标是一个 Dataframe ,那么我认为你可以跳过语料库步骤,直接以列表的形式读入所有文件,困难的部分是将.txt文件的名称放入一个名为DOCUMENT的列中,但这可以在base R中完成。

# make a reproducible example
a <- "this is a test"
b <- "this is a second test"
c <- "this is a third test"
write(a, "a.txt"); write(b, "b.txt"); write(c, "c.txt")

# get working dir
folder <- getwd()

# get names/locations of all files
filelist <- list.files(path = folder, pattern =" *.txt", full.names = FALSE)

# read in the files and put them in a list
lst <- lapply(filelist, readLines)

# extract the names of the files without the `.txt` stuff
names(lst) <- filelist
namelist <- fs::path_file(filelist)
namelist <- unlist(lapply(namelist, sub, pattern = ".txt", replacement = ""), 
                   use.names = FALSE)

# give every matrix in the list its own name, which was its original file name
lst <- mapply(cbind, lst, "DOCUMENT" = namelist, SIMPLIFY = FALSE)

# combine into a dataframe
x <- do.call(rbind.data.frame, lst) 

# a small amount of clean-up
rownames(x) <- NULL
names(x)[names(x) == "V1"] <- "CONTENT"
x <- x[,c(2,1)]
x
#>   DOCUMENT               CONTENT
#> 1        a        this is a test
#> 2        b this is a second test
#> 3        c  this is a third test

相关问题