循环读取txt文件和在 Dataframe 中创建字数变量所需的时间过长

oalqel3c 于 2023-01-06 发布在其他

关注(0)|答案(1)|浏览(120)

我试运行了3个txt文件，在数据框中为每个文件创建一个变量效果很好，但是我有8,000个左右的txt文件，时间太长了，有没有办法提高速度？
先谢了。

library(dplyr)
library(stringr)
library(pdftools)
library(readtext)

setwd('E:/txt/GEM_TXT_2022')
files <- list.files(path='E:/txt/GEM_TXT_2022', pattern= ".txt")

filelength<-length(files)
word_count.TOTAL<-seq(1,filelength)
word_count.TOTAL<-as.data.frame(word_count.TOTAL)
head(word_count.TOTAL)

for(j in 1:length(files)){
  P1<-readtext(files[j]) %>% 
    str_replace_all("\\t","") %>% #replace tab
    str_replace_all("\n"," ")  %>% #replace line break
    str_replace_all("      "," ")%>% 
    str_replace_all("     "," ")%>% 
    str_replace_all("    "," ")%>% 
    str_replace_all("   "," ")%>% 
    str_replace_all("  "," ")%>% 
    str_replace_all("[:digit:]","")%>%  
    str_replace_all("[:punct:]","")%>%  
    str_trim()
 
  for(i in 1: length(files)){
    word_count.TOTAL[j,]<-str_count(P1)
  }
  
  }
head(word_count.TOTAL)

word_count.TOTAL2<-as.data.frame(word_count.TOTAL)
rownames(word_count.TOTAL2)<-files
 head(word_count.TOTAL2)

r

来源：https://stackoverflow.com/questions/75006205/loop-takes-too-long-to-read-txt-files-and-create-a-word-count-variable-in-a-data

1条答案

按热度按时间

iqxoj9l91#

使用{vroom}快速访问文本文件、使用{stringr}进行字符串操作以及使用{tidytext}可能会有所帮助。例如：

加载包，设置路径常量，并从源目录读入所有文本文件：

library(vroom)
library(stringr)
library(tidytext)

source_path <- 'path/to/your/text/files'
file_names <- list.files(source_path, pattern = '\\.txt')
file_paths <- file.path(source_path, file_names)

all_lines <- vroom(file_paths,
                   delim = '___', ## see comment (1)
                   col_names = 'raw_text',
                   id = 'source_file') ## (2)

(1)* 确保使用不会出现在文档中的分隔符，以将所有内容保留在一列中 *
(2)* 在“source_file”列中存储文件路径 *

执行一些字符串清理和unnest by token（此处为：单词），为每个单词创建单独的行：

all_lines <- 
  all_lines |> 
  mutate(raw_text = str_remove_all(raw_text, '[\\d]')) |>
  unnest_tokens(words, raw_text)

按源文件分组数据并计算行数以获得每个文件的字数：

all_lines |>
  group_by(source_file) |>
  summarise(word_count = n())

赞(0）回复(0）举报 2023-01-06

我来回答

循环读取txt文件和在 Dataframe 中创建字数变量所需的时间过长

1条答案

相关问题

热门标签

最新问答