R语言 获取TF-IDF数据时出现内存问题

ymzxtsji  于 2023-06-27  发布在  其他
关注(0)|答案(2)|浏览(345)

简介

我正在努力对一个大的推文数据集进行文本分类,如果有人能为我指出正确的方向,我会很感激。
大的图景是,我需要训练一个分类器,它可以在一个巨大的数据集(多达600万个文本)上区分两个类。我一直在 recipes 框架中做,然后通过 tidymodels 运行glmnet lasso。具体的问题是我在计算tf-idf时内存不足。

提问

我应该以何种方式来解决这个问题?我基本上可以手动批量获得所有的tf-idf值,然后再次手动将它们组合成一个稀疏矩阵对象。这听起来很烦人,肯定有人以前遇到过这个问题并解决了它?另一个选择是Spark,但它远远超出了我目前的能力,对于一次性任务来说可能是矫枉过正。或者我缺少了一些东西,而现有的工具能够做到这一点?
具体来说,我在运行以下代码时遇到了两种问题(变量应该是自解释的,但我稍后将提供完整的可重现代码):

recipe <-
  recipe(Class ~ text, data = corpus) %>% 
  step_tokenize(text) %>%
  step_stopwords(text) %>% 
  step_tokenfilter(text, max_tokens = m) %>% 
  step_tfidf(text) %>% 
  prep()

如果corpus太大或m太大,Rstudio会崩溃。如果它们是中等大小的,它会抛出一个 * 警告 *:

In asMethod(object) :
  sparse->dense coercion: allocating vector of size 1.2 GiB

我在网上找不到任何关于它的东西,我不明白它。为什么它试图强迫一些东西从稀疏到密集?这无疑会给任何大型数据集带来麻烦。我做错什么了吗?如果这是可以预防的,也许我会有更好的运气与我的完整的数据集?
或者step_tfidf没有希望科普600万次观察,并且没有最大令牌的限制?
P.S. tmtidytext甚至不能开始处理这个问题。

完整代码

我将给予一个可重复的例子来说明我正在尝试做的事情。这段代码建立了一个tweet长文本的语料库,其中包含大小为5 m+的随机单词:

library(tidymodels)
library(dplyr)
library(stringr)
library(textrecipes)
library(hardhat)

url <- "https://gutenberg.org/cache/epub/2701/pg2701-images.html"
words <- readLines(url, encoding = "UTF-8") %>% str_extract_all('\\w+\\b') %>% unlist()
x <- rnorm(n = 6000000, mean = 18, sd = 14)
x <- x[x > 0]

corpus <- 
  lapply(x, function(i) {
    c('text' = paste(sample(words, size = i, replace = TRUE), collapse = ' '))
  }) %>% 
  bind_rows() %>% 
  mutate(ID = 1:n(), Class = factor(sample(c(0, 1), n(), replace = TRUE)))

corpus看起来像这样:

> corpus
# A tibble: 5,402,638 × 3
   text                                                                                                                                       ID Class
   <chr>                                                                                                                                   <int> <fct>
 1 included Fast at can aghast me some as article and ship things is                                                                           1 1    
 2 him to quantity while became man was childhood it that Who in on his the is                                                                 2 1    
 3 no There a pass are it in evangelical rather in direst the in a even reason to Yes and the this unconditional his clear other thou all…     3 0    
 4 this would against his You disappeared have summit the vagrant in fine inland is scrupulous signifies that come the the buoyed and of …     4 1    
 5 slippery the Judge ever life Moby But i will after sounding ship like p he Like                                                             5 1    
 6 at can hope running                                                                                                                         6 1    
 7 Jeroboam even there slow though thought though I flukes yarn swore called p oarsmen with sort who looked and sharks young Radney s          7 1    
 8 not if rocks ever lantern go last though at you white his that remains of primal Starbuck sans you steam up with against                    8 1    
 9 Nostril as p full the furnish are nor made towards except bivouacks p blast how never now are here of difference it whalemen s much th…     9 1    
10 and p multitudinously body Archive fifty was of Greenland                                                                                  10 0    
# ℹ 5,402,628 more rows
# ℹ Use `print(n = ...)` to see more rows

它本身大约有1Gb的RAM。
我做了标准的建模工作流程,我将在这里完整地介绍,只是为了充分的信息。

# prep
corpus_split <- initial_split(corpus, strata = Class) # split
corpus_train <- training(corpus_split)
corpus_test <- testing(corpus_split)
folds <- vfold_cv(corpus_train) #k-fold cv prep
sparse_bp <- hardhat::default_recipe_blueprint(composition = "dgCMatrix") # use sparse matrices
smaller_lambda <- grid_regular(penalty(range = c(-5, 0)), levels = 20) # hyperparameter calibration

# recipe
recipe <-
  recipe(Ad ~ text, data = corpus_train) %>% 
  step_tokenize(text) %>%
  step_stopwords(text, custom_stopword_source = 'twclid') %>% 
  step_tokenfilter(text, max_tokens = 10000) %>% 
  step_tfidf(text)

# lasso model
lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>% # tuning the penalty hyperparameter
  set_mode("classification") %>%
  set_engine("glmnet")

# workflow
sparse_wf <- workflow() %>%
  add_recipe(recipe, blueprint = sparse_bp) %>%
  add_model(lasso_spec)

# fit
sparse_rs <- tune_grid(
  sparse_wf,
  folds,
  grid = smaller_lambda
)
zwghvu4y

zwghvu4y1#

可悲的是,在tidymodels中,你现在能做的并不多。{tidymodels}包集围绕使用{tibble}作为其公共数据容器。这在许多情况下都很有效,除了这里的稀疏数据。
当在工作流中使用配方时,需要将数据作为tibble传递给防风草。这需要数据是非稀疏的,在您的情况下,它会使数据大小疯狂地爆炸!也就是说,如果你有600万个观察结果,只有2000个不同的令牌,你最终会得到96 GB的数据。
这是我希望在某个时候发生的事情(我是{textrecipes}的作者,也是tidymodels团队的开发人员之一),但它目前超出了我的控制范围,因为我们需要找到一种方法在tibles中拥有稀疏数据。

y0u0uwnf

y0u0uwnf2#

如果有人需要,我会总结我的发现。
存在两个问题:(i)创建tf-idf矩阵需要大量内存,(ii)tinymodels 目前只接受tibles作为传入数据,正如EmilHvitfeldt所指出的那样。解决方案是以更内存友好的方式生成tf-idf数据集,用通常的方法稀疏化,然后直接使用支持稀疏数据的模型。
最大的问题是,现有的计算tf-idf的解决方案(我尝试了tmtidytext)内存效率很低。我所做的是:
1.需要注意的是,我有足够的内存来将所有文本首先加载到内存中。
1.将文本存储为带有 no 分组和max_rows_per_file = 1000000的箭头数据集(此数字可根据您的内存需求进行定制)。
1.计算tf-idf所需的变量并将其存储为单独的箭头数据集:字数、文本长度和文档字数。
1.循环遍历其中一个数据集的文件,左连接其他两个数据集的数据(这发生在内存中,但因为每个文件只包含总观测的一部分,所以这不是问题)。
1.手动另保存为数据集中的Parquet文件。
1.将数据集作为数据集打开,收集,并将tidytext::cast_sparse转换为稀疏矩阵。

corpus %>% 
  write_dataset('tokenized_texts', max_rows_per_file = 1000000)

ds <- open_dataset('tokenized_texts')

# N is the total number of texts
N <- ds %>% 
  summarize(N = max(TextID)) %>% 
  collect() %>% 
  pull(N)

# this computes the number of times a word appears within a given text
ds.n <- 
  ds %>% 
  group_by(TextID, word) %>% 
  count() %>% 
  collect()

ds.n %>% 
  ungroup() %>% 
  write_dataset('tokenized_arrow/ds.n', max_rows_per_file = 1000000)
rm(ds.n)
gc()

# this computes the total number of words in the dataset
ds.total <- 
  ds %>%   
  group_by(TextID) %>% 
  count(name = 'TotalWords') %>% 
  collect()
ds.total %>% 
  ungroup() %>% 
  write_dataset('tokenized_arrow/ds.total', max_rows_per_file = 1000000)
rm(ds.total)
gc()

# this computes the number of times a word appears (at least once) in texts
ds.docs <- 
  ds %>% 
  group_by(TextID, word) %>% 
  summarize() %>% 
  group_by(word) %>% 
  count(name = 'Documents') %>% 
  collect()
ds.docs %>% 
  ungroup() %>% 
  write_dataset('tokenized_arrow/ds.docs', max_rows_per_file = 1000000)
rm(ds.docs)
gc()

# Load the prepared datasets
ds.n <- open_dataset('cache/tokenized_arrow/ds.n')
ds.total <- open_dataset('cache/tokenized_arrow/ds.total')
ds.docs <- open_dataset('cache/tokenized_arrow/ds.docs')

# Loop through (mclapply was an overkill, this is a super fast step). Assumes the directory "final" exists.

files <- list.files('tokenized_arrow/ds.n', full.names = TRUE)
mclapply(files, mc.cores = parallel::detectCores() - 2, FUN = function(file) {
  outfile <- str_replace(file, 'ds\\.n', 'final')
  
  df <- read_parquet(file)
  ids <- unique(df$TextID)
  words <- unique(df$word)
  df %>% 
    left_join(
      ds.total %>% 
        filter(TextID %in% ids) %>% 
        collect()) %>% 
    left_join(
      ds.docs %>%
        filter(word %in% words) %>%
        collect()
    ) %>% 
    mutate(tf = n / TotalWords,
           idf = log(N / Documents),
           tf_idf = tf * idf) %>% 
    write_parquet(outfile)
  return(NULL)
}) %>% invisible()

# sparsify
m <- 
  open_dataset('cache/tokenized_arrow/final/') %>% 
  collect() %>% 
  cast_sparse(TextID, word, tf_idf)

相关问题