循环读取txt文件和在 Dataframe 中创建字数变量所需的时间过长

oalqel3c  于 2023-01-06  发布在  其他
关注(0)|答案(1)|浏览(120)

我试运行了3个txt文件,在数据框中为每个文件创建一个变量效果很好,但是我有8,000个左右的txt文件,时间太长了,有没有办法提高速度?
先谢了。

library(dplyr)
library(stringr)
library(pdftools)
library(readtext)

setwd('E:/txt/GEM_TXT_2022')
files <- list.files(path='E:/txt/GEM_TXT_2022', pattern= ".txt")

filelength<-length(files)
word_count.TOTAL<-seq(1,filelength)
word_count.TOTAL<-as.data.frame(word_count.TOTAL)
head(word_count.TOTAL)

for(j in 1:length(files)){
  P1<-readtext(files[j]) %>% 
    str_replace_all("\\t","") %>% #replace tab
    str_replace_all("\n"," ")  %>% #replace line break
    str_replace_all("      "," ")%>% 
    str_replace_all("     "," ")%>% 
    str_replace_all("    "," ")%>% 
    str_replace_all("   "," ")%>% 
    str_replace_all("  "," ")%>% 
    str_replace_all("[:digit:]","")%>%  
    str_replace_all("[:punct:]","")%>%  
    str_trim()
 
  for(i in 1: length(files)){
    word_count.TOTAL[j,]<-str_count(P1)
  }
  
  }
head(word_count.TOTAL)

word_count.TOTAL2<-as.data.frame(word_count.TOTAL)
rownames(word_count.TOTAL2)<-files
 head(word_count.TOTAL2)
iqxoj9l9

iqxoj9l91#

使用{vroom}快速访问文本文件、使用{stringr}进行字符串操作以及使用{tidytext}可能会有所帮助。例如:

  • 加载包,设置路径常量,并从源目录读入所有文本文件:
library(vroom)
library(stringr)
library(tidytext)

source_path <- 'path/to/your/text/files'
file_names <- list.files(source_path, pattern = '\\.txt')
file_paths <- file.path(source_path, file_names)

all_lines <- vroom(file_paths,
                   delim = '___', ## see comment (1)
                   col_names = 'raw_text',
                   id = 'source_file') ## (2)

(1)* 确保使用不会出现在文档中的分隔符,以将所有内容保留在一列中 *
(2)* 在“source_file”列中存储文件路径 *

  • 执行一些字符串清理和unnest by token(此处为:单词),为每个单词创建单独的行:
all_lines <- 
  all_lines |> 
  mutate(raw_text = str_remove_all(raw_text, '[\\d]')) |>
  unnest_tokens(words, raw_text)
  • 按源文件分组数据并计算行数以获得每个文件的字数:
all_lines |>
  group_by(source_file) |>
  summarise(word_count = n())

相关问题