识别和分组R中的同义词

gorkyyrv  于 12个月前  发布在  其他
关注(0)|答案(2)|浏览(79)

我正在尝试识别和聚合给定数据集的同义词。请参阅下面的示例数据。

library(tm)
library(SnowballC)

dataset <- c("dad glad accept large admit large accept dad big large big accept big accept dad dad Happy dad accept glad papa dad Happy dad glad dad dad papa admit Happy big accept accept big accept dad Happy admit Happy Happy glad Happy dad accept accept large daddy large accept large large large big daddy accept admit dad admit daddy dad admit dad admit Happy accept accept Happy daddy accept admit")

docs <- Corpus(VectorSource(dataset))
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
sort(rowSums(m),decreasing=TRUE)

字符串
测试结果:

accept    dad  happy  admit  large    big  daddy   glad   papa 
    15     14      9      8      8      6      4      4      2


我想使用我下载并安装的wordnet软件包来查找上述每个单词的同义词。例如,要获取“accept”的同义词,我可以这样做:

library(wordnet)
setDict("C:/Program Files (x86)/WordNet/2.1/dict")

filter <- getTermFilter("ExactMatchFilter", "accept", TRUE)
terms <- getIndexTerms("VERB", 1, filter)
getSynonyms(terms[[1]])


测试结果:

[1] "accept"    "admit"     "assume"    "bear"      "consent"   "go for"    "have"      "live with"
 [9] "swallow"   "take"      "take on"   "take over"


现在,我想将这两个结果集合并组合起来,这样它就可以按以下方式对同义词进行分组。为给定的组标记最常见的单词(排名1),然后按这些单词进行分组,类似于这样:

id  word    word_count  syn_group   rank
1   accept  15          1           1
5   admit   8           1           2
2   dad     14          2           1
8   daddy   4           2           2
9   papa    2           2           3
3   happy   9           3           1
7   glad    4           3           2
4   large   8           4           1
6   big     6           4           2


然后可以像这样聚合

id  word    word_count
1   accept  15+8
2   dad     14+4+2
3   happy   9+4
4   large   8+6


最后的结果就是

id  word    word_count
1   accept  23
2   dad     20
3   large   14
4   happy   13


我遇到了几个问题,包括让GetIndexTerms循环通过单词,无论它们是名词,动词等。希望这一切都有意义?任何帮助都将不胜感激。谢谢。

fnatzsnv

fnatzsnv1#

我们可以使用dplyr执行以下操作

library(dplyr)
df %>% 
  group_by(syn_group) %>%
  mutate(sum_word_count = sum(word_count)) %>% 
  filter(rank == 1)

字符串

数据:

df <- read.table(text = "id  word    word_count  syn_group   rank
1   accept  15          1           1
5   admit   8           1           2
2   dad     14          2           1
8   daddy   4           2           2
9   papa    2           2           3
3   happy   9           3           1
7   glad    4           3           2
4   large   8           4           1
6   big     6           4           2", header = T)


请下次发布dput的输出。

  • 编辑:* 这里有一些代码可以帮助你开始循环单词和词性,并存储同义词。剩下的就是确定当前术语是否是前一个术语的同义词,在这种情况下,你已经有了同义词,你可以分配一个唯一的同义词组。接下来,你需要存储一些结果。最后,你需要计算排名,这只是seq_along同义词和grep来确定排名位置。注解是提示你可能需要在哪里包含这些提示的代码。
d <- data.frame(Term = row.names(m), word_count = m[,1])
all_pos <- c("ADJECTIVE", "ADVERB", "NOUN","VERB")
syns <- vector("list", length(all_pos))
for(w in seq(nrow(d))){
  # if sysns of (d$Term[w]) has been calculated skip over current w 
  emf <- getTermFilter("ExactMatchFilter", d$Term[w], TRUE)  
  for(i in seq_along(syns)){
    terms <- getIndexTerms(all_pos[i], 1, emf)
    if(is.null(terms)){
      syns[i] <- NA
    } else{
      syns[[i]] <-  getSynonyms(terms[[1]])
    }
  }
  # store the results of syns for current w 
}

hlswsv35

hlswsv352#

我已经能够提取法语同义词自动从一个网站如下:

library(stringr)
library(pagedown)
library(pdftools)
path_Save_PDF <- "D:\\"
base_Url <- "https://dictionary.reverso.net/french-synonyms/"
words <- c("fâché")
nb_Words <- length(words)
list_Text <- list()

for(i in 1 : nb_Words)
{
  print(i)
  pdf_File <- paste0(path_Save_PDF, words[i], ".pdf")
  chrome_print(input = paste0(base_Url, words[i]), output = pdf_File)
  list_Text[[i]] <- pdftools::pdf_text(pdf_File)
  list_Text[[i]] <- strsplit(x = list_Text[[i]], split = "\n")
}

save(list_Text, file = "list_Text.RData")

list_Synonymes <- list()
for(i in 1 : nb_Words)
{
  print(i)
  id_Lines_Synonymes <- which(str_detect(string = list_Text[[i]][[1]], pattern = "[:space:]{4,8}\\d{1,2}"))
  text_Synonymes <- list_Text[[i]][[1]][id_Lines_Synonymes]
  text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "Facebook®(.*)Visit Site")
  text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "and post updates\\.")
  text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "par extension au sens figuré")
  text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "details")
  text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "\\d")
  text_Synonymes <- stringr::str_squish(text_Synonymes)
  text_Synonymes <- paste0(text_Synonymes, collapse = ",")
  text_Synonymes <- stringr::str_replace_all(string = text_Synonymes, pattern = "\\,\\,", replacement = "\\,")
  text_Synonymes <- base::strsplit(text_Synonymes, ",")[[1]]
  list_Synonymes[[i]] <- text_Synonymes
}

names(list_Synonymes) <- words
list_Synonymes

list_Synonymes
$fâché
 [1] "dépité"                                       
 [2] " grognon"                                     
 [3] " mécontent"                                   
 [4] " morfondu"                                    
 [5] " transi"                                      
 [6] " horripilé"                                   
 [7] " irrité"                                      
 [8] " contrarié"                                   
 [9] " ennuyé"                                      
[10] " frissonnant"                                 
[11] "navré"                                        
[12] " désolé"                                      
[13] "vexé"                                         
[14] " indisposé"                                   
[15] " piqué"                                       
[16] "en colère"                                    
[17] " mécontent"                                   
[18] "désolé"                                       
[19] " navré"                                       
[20] "brouillé avec quelqu'un"                      
[21] " en froid"                                    
[22] " être incompétent dans un domaine particulier"
[23] " ne rien comprendre"

字符串
之后,它可以用来将同义词分组在一起。

相关问题