我按照this教程创建了一个文档-特征矩阵,其中包含字典定义的特征,现在我得到的是两列输出,分别给出了文档ID和字典中所有特征的频率。
library(lubridate)
library(quanteda)
## subset data
item7_corpus_subset <- item_7_corpus |>
filter(year(filing_date) == year_data) |>
head(100) ## edit here, comment if codes work well
# tokenize
item7_tokens <- tokens(item7_corpus_subset,
what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE) |>
tokens_ngrams(n = 1:3)
## count words from dictionary
item7_doc_dict <- item7_tokens |>
dfm(tolower = TRUE) |>
dfm_lookup(dictionary = cyber_dict, levels = 1:3)
print(item7_doc_dict)
## Document-feature matrix of: 100 documents, 1 feature (94.00% sparse) and 13 docvars.
## features
## docs cyber_dict
## 1000015_10K_1999_0000912057-00-014793.txt 0
## 1000112_10K_1999_0000930661-00-000704.txt 0
## 1000181_10K_1999_0001000181-00-000001.txt 0
## 1000227_10K_1999_0000950116-00-000643.txt 0
## 1000228_10K_1999_0000889812-00-001326.txt 0
## 1000230_10K_1999_0001005150-00-000103.txt 0
## [ reached max_ndoc ... 94 more documents ]
我想看到每个关键字的频率,而不是我拥有的所有关键字的总频率。我试图模拟教程生成的:
一个二个一个一个
1条答案
按热度按时间gk7wooem1#
有三个错误:
tokens_ngrams()
在字典分析之前使用exclusive = FALSE
,则包括所有其他单词您的代码应该是