如何通过docvar的值重塑Quanteda语料库？

siv3szwd 于 11个月前发布在其他

关注(0)|答案(1)|浏览(90)

我正在使用quanteda包使用R语言的文本语料库。假设这个语料库包含一些分成句子的文本。从理论上讲，使用corpus_reshape()可以很容易地在句子和实际文档之间切换作为分析单位。但是，如果我想根据docvars中特定变量的值来重塑语料库，该怎么办？

# Load quanteda

library(quanteda)

Package version: 3.3.1
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 8 of 8 threads used.

# Create a simulated corpus
texts <- c(
  "Document one text. It has several sentences. Here is another sentence.",
  "Document two is slightly longer. It has more sentences. This is the third sentence. And here is the fourth."
)

# Create dummy across documents
docvars <- data.frame(dummy_var = c(1,0,1,0,1,0,1))

# Create the corpus
my_corpus <- corpus(texts) %>% corpus_reshape(to = "sentences")

# Define docvars
docvars(my_corpus) <- docvars

# Reshape to document parts based on dummy_var?
...

字符串
所需的输出将是一个新的语料库，其中每个文档都根据虚拟变量分为两部分，在这种情况下总共有4个文档。
有人能建议一个有效的方法在quanteda做到这一点吗？
我的想法是将文档分成“halves”> tokenize > dfm来准备缩放（例如wordfish），看看句子所属的组是否有任何区别。具体来说，我的问题是：当涉及到dummy_var == 0时，是否有更多的左右差异？
请让我知道这种方法是否有任何缺陷。

r

来源：https://stackoverflow.com/questions/77683846/how-to-reshape-quanteda-corpus-by-values-of-a-docvar

1条答案

按热度按时间

70gysomp1#

我认为我找到了一个使用tidyverse和quanteda的组合来解决我的问题的解决方案。它看起来是这样的：

library(tidyverse)

docvars(my_corpus) <- docvars(my_corpus) %>%
  mutate(index = paste0(docid(my_corpus), "_", dummy_var))

corpus_group(my_corpus, index)

字符串
然而，我仍然不能完全确定我的方法是否正确。

赞(0）回复(0）举报 11个月前

我来回答

如何通过docvar的值重塑Quanteda语料库？

1条答案

相关问题

热门标签

最新问答