R语言 使用Quanteda可视化字典术语的频率

q8l4jmvw  于 2023-01-18  发布在  其他
关注(0)|答案(1)|浏览(182)

我正在分析几千篇报纸文章的文本,我想建立一个问题词典(例如,医疗保健,税收,犯罪等)。每个词典条目由几个术语组成(例如,医生,护士,医院等)
作为一种诊断,我想看看哪些术语构成了每个字典类别的主体。
这段代码说明了我所处的位置。我已经找到了一种方法来分别打印每个字典条目的顶部特性,但我希望在最后有一个连贯的 Dataframe ,这样我就可以可视化了。

library(quanteda)
]# set path
path_data <- system.file("extdata/", package = "readtext")

# import csv file
dat_inaug <- read.csv(paste0(path_data, "/csv/inaugCorpus.csv"))
corp_inaug <- corpus(dat_inaug, text_field = "texts") 
  corp_inaug %>% 
tokens(., remove_punct = T) %>% 
  tokens_tolower() %>% 
  tokens_select(., pattern=stopwords("en"), selection="remove")->tok

#I have about eight or nine dictionaries 
dict<-dictionary(list(liberty=c("freedom", "free"), 
                      justice=c("justice", "law")))
#This producesa a dfm of all the individual terms making up the dictionary
tok %>% 
tokens_select(pattern=dict) %>% 
  dfm() %>% 
  topfeatures()
  
#This produces the top features just making up the 'justice' dictionary entry
tok %>% 
  tokens_select(pattern=dict['justice']) %>% 
  dfm() %>% 
  topfeatures()
#This gets me close to what I want, but I can't figure out how to collapse this now 
#to visualize which are the most frequent terms that are making up each dictionary category

dict %>% 
  map(., function(x) tokens_select(tok, pattern=x)) %>% 
  map(., dfm) %>% 
map(., topfeatures)
8wtpewkr

8wtpewkr1#

我整理了代码,并使用data_corpus_inaugural作为示例,这展示了如何通过字典键获取频率数据,对于每个键中字典值的选定匹配。

library("quanteda")
#> Package version: 3.2.4
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")

toks <- data_corpus_inaugural %>% 
  tokens(remove_punct = TRUE) %>% 
  tokens_tolower() %>% 
  tokens_remove(pattern = stopwords("en"))

dict <- dictionary(list(liberty = c("freedom", "free"), 
                        justice = c("justice", "law")))

dfmat_list <- lapply(names(dict), function(x) {
  tokens_select(toks, dict[x]) %>%
    dfm() %>%
    textstat_frequency() %>%
    cbind(data.frame(dict_key = x), .)
})

do.call(rbind, dfmat_list)
#>    dict_key feature frequency rank docfreq group
#> 1   liberty freedom       185    1      36   all
#> 2   liberty    free       183    2      49   all
#> 11  justice justice       142    1      47   all
#> 21  justice     law       129    2      38   all

创建于2023年1月15日,使用reprex v2.0.2

相关问题