R语言 quanteda-dfm(通过dfm_lookup)不显示术语列作为示例

jfewjypa  于 2023-03-15  发布在  其他
关注(0)|答案(1)|浏览(117)

我按照this教程创建了一个文档-特征矩阵,其中包含字典定义的特征,现在我得到的是两列输出,分别给出了文档ID和字典中所有特征的频率。

library(lubridate)
library(quanteda)

## subset data 
item7_corpus_subset <- item_7_corpus |> 
    filter(year(filing_date) == year_data) |> 
    head(100) ## edit here, comment if codes work well
    
# tokenize
item7_tokens <- tokens(item7_corpus_subset, 
                       what = "word", 
                       remove_punct = TRUE,
                       remove_symbols = TRUE,
                       remove_numbers = TRUE,
                       remove_url = TRUE) |> 
    tokens_ngrams(n = 1:3)
       
## count words from dictionary
item7_doc_dict <- item7_tokens |> 
    dfm(tolower = TRUE) |> 
    dfm_lookup(dictionary = cyber_dict, levels = 1:3)
print(item7_doc_dict)
## Document-feature matrix of: 100 documents, 1 feature (94.00% sparse) and 13 docvars.
## features
## docs                                        cyber_dict
## 1000015_10K_1999_0000912057-00-014793.txt          0
## 1000112_10K_1999_0000930661-00-000704.txt          0
## 1000181_10K_1999_0001000181-00-000001.txt          0
## 1000227_10K_1999_0000950116-00-000643.txt          0
## 1000228_10K_1999_0000889812-00-001326.txt          0
## 1000230_10K_1999_0001005150-00-000103.txt          0
## [ reached max_ndoc ... 94 more documents ]

我想看到每个关键字的频率,而不是我拥有的所有关键字的总频率。我试图模拟教程生成的:
一个二个一个一个

gk7wooem

gk7wooem1#

有三个错误:

  • tokens_ngrams()在字典分析之前使用
  • 字典中只有一个组(键)
  • 如果exclusive = FALSE,则包括所有其他单词

您的代码应该是

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.3
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
# create a sample corpus
texts <- c("This is a sample text mentioning cyber attack.",
           "Cybersecurity is important to protect against cyber threats.",
           "The company experienced a data breach due to a cyber attack.",
           "Cyber criminals are becoming increasingly sophisticated.",
           "Protecting against cyber attacks requires a multi-layered approach.")

corpus <- corpus(texts)

dict <- dictionary(list(cyber = c("cyber", "cybersecurity", "cybercriminals"), 
                        attack = c("cyberattack", "data breach", "protect")))

toks <- quanteda::tokens(corpus)

dfmt <- dfm(toks)
dfmt
#> Document-feature matrix of: 5 documents, 31 features (70.97% sparse) and 0 docvars.
#>        features
#> docs    this is a sample text mentioning cyber attack . cybersecurity
#>   text1    1  1 1      1    1          1     1      1 1             0
#>   text2    0  1 0      0    0          0     1      0 1             1
#>   text3    0  0 2      0    0          0     1      1 1             0
#>   text4    0  0 0      0    0          0     1      0 1             0
#>   text5    0  0 1      0    0          0     1      0 1             0
#> [ reached max_nfeat ... 21 more features ]

dfmt_dict <- dfm_lookup(dfmt, dictionary = dict, levels = 1, 
                         exclusive = TRUE, capkeys = FALSE)
dfmt_dict
#> Document-feature matrix of: 5 documents, 2 features (40.00% sparse) and 0 docvars.
#>        features
#> docs    cyber attack
#>   text1     1      0
#>   text2     2      1
#>   text3     1      0
#>   text4     1      0
#>   text5     1      0

相关问题