R中的Jaccard相似性

r1zhe5dt 于 2023-04-27 发布在其他

关注(0)|答案(1)|浏览(140)

我有一个包含792个协议的文本 Dataframe ，我已经对它们进行了预处理并将其转换为dfm。我正在尝试使用相似度评分，我决定同时使用jaccard和cosine相似度。
当我做余弦相似度的时候，只需要半分钟就能得到结果。但是在过去的两天里，每当我用jaccard做同样的事情时，我的电脑就开始嗡嗡作响，R终止了。我是不是漏掉了什么？jaccard函数不再起作用了吗？
我把代码放在下面。

library(quanteda)
library(tidyr)

# view the resulting cosine similarity matrix
s1 <- textstat_simil(trimmed_dfm, method = "cosine", margin = "documents")

#Convert the output into a into a dataframe (first needs to be converted to a matrix)
cosine_simil_df <- as.data.frame(as.matrix(s1))
#Create a column with the row names of the matrix
cosine_simil_df$PTA1 <- row.names(cosine_simil_df)
#Use pivot longer gather verb to reshape the data in Tidy format
cosine_simil_df_final <- pivot_longer(cosine_simil_df, cols = -PTA1, names_to = "PTA2", values_to = "similarity")
head(cosine_simil_df_final)

##### Let's try with the Jaccard similarity
s2<- textstat_simil(trimmed_dfm, method = "jaccard", margin = "documents")
#this line is when it all goes wrong

jaccard_simil_df<- as.data.frame(as.matrix(s2))

jaccard_simil_df$PTA1 <- row.names(jaccard_simil_df)

r

来源：https://stackoverflow.com/questions/76062900/jaccard-similarity-in-r

1条答案

按热度按时间

hmtdttj41#

我没有像优化余弦相似度那样优化函数的Jaccard相似度。你可以尝试drop0 = TRUE来减少内存使用。proxyC::simil()是textstat_simil()背后的包。

proxyC::simil(matrix(c(1, 0, 0, 1), nrow = 2), 
              matrix(c(2, 2, 0, 0), nrow = 2), method = "jaccard")
#> 2 x 2 sparse Matrix of class "dgTMatrix"
#>         
#> [1,] 1 1
#> [2,] 0 0
proxyC::simil(matrix(c(1, 0, 0, 1), nrow = 2), 
              matrix(c(2, 2, 0, 0), nrow = 2), method = "jaccard", drop0 = TRUE)
#> 2 x 2 sparse Matrix of class "dgTMatrix"
#>         
#> [1,] 1 1
#> [2,] . .

赞(0）回复(0）举报 2023-04-27

我来回答

R中的Jaccard相似性

1条答案

相关问题

热门标签

最新问答