我正在分析几千篇报纸文章的文本,我想建立一个问题词典(例如,医疗保健,税收,犯罪等)。每个词典条目由几个术语组成(例如,医生,护士,医院等)
作为一种诊断,我想看看哪些术语构成了每个字典类别的主体。
这段代码说明了我所处的位置。我已经找到了一种方法来分别打印每个字典条目的顶部特性,但我希望在最后有一个连贯的 Dataframe ,这样我就可以可视化了。
library(quanteda)
]# set path
path_data <- system.file("extdata/", package = "readtext")
# import csv file
dat_inaug <- read.csv(paste0(path_data, "/csv/inaugCorpus.csv"))
corp_inaug <- corpus(dat_inaug, text_field = "texts")
corp_inaug %>%
tokens(., remove_punct = T) %>%
tokens_tolower() %>%
tokens_select(., pattern=stopwords("en"), selection="remove")->tok
#I have about eight or nine dictionaries
dict<-dictionary(list(liberty=c("freedom", "free"),
justice=c("justice", "law")))
#This producesa a dfm of all the individual terms making up the dictionary
tok %>%
tokens_select(pattern=dict) %>%
dfm() %>%
topfeatures()
#This produces the top features just making up the 'justice' dictionary entry
tok %>%
tokens_select(pattern=dict['justice']) %>%
dfm() %>%
topfeatures()
#This gets me close to what I want, but I can't figure out how to collapse this now
#to visualize which are the most frequent terms that are making up each dictionary category
dict %>%
map(., function(x) tokens_select(tok, pattern=x)) %>%
map(., dfm) %>%
map(., topfeatures)
1条答案
按热度按时间8wtpewkr1#
我整理了代码,并使用
data_corpus_inaugural
作为示例,这展示了如何通过字典键获取频率数据,对于每个键中字典值的选定匹配。创建于2023年1月15日,使用reprex v2.0.2