我试运行了3个txt文件,在数据框中为每个文件创建一个变量效果很好,但是我有8,000个左右的txt文件,时间太长了,有没有办法提高速度?
先谢了。
library(dplyr)
library(stringr)
library(pdftools)
library(readtext)
setwd('E:/txt/GEM_TXT_2022')
files <- list.files(path='E:/txt/GEM_TXT_2022', pattern= ".txt")
filelength<-length(files)
word_count.TOTAL<-seq(1,filelength)
word_count.TOTAL<-as.data.frame(word_count.TOTAL)
head(word_count.TOTAL)
for(j in 1:length(files)){
P1<-readtext(files[j]) %>%
str_replace_all("\\t","") %>% #replace tab
str_replace_all("\n"," ") %>% #replace line break
str_replace_all(" "," ")%>%
str_replace_all(" "," ")%>%
str_replace_all(" "," ")%>%
str_replace_all(" "," ")%>%
str_replace_all(" "," ")%>%
str_replace_all("[:digit:]","")%>%
str_replace_all("[:punct:]","")%>%
str_trim()
for(i in 1: length(files)){
word_count.TOTAL[j,]<-str_count(P1)
}
}
head(word_count.TOTAL)
word_count.TOTAL2<-as.data.frame(word_count.TOTAL)
rownames(word_count.TOTAL2)<-files
head(word_count.TOTAL2)
1条答案
按热度按时间iqxoj9l91#
使用{vroom}快速访问文本文件、使用{stringr}进行字符串操作以及使用{tidytext}可能会有所帮助。例如:
(1)* 确保使用不会出现在文档中的分隔符,以将所有内容保留在一列中 *
(2)* 在“source_file”列中存储文件路径 *
unnest
by token(此处为:单词),为每个单词创建单独的行: