从R中的向量中提取单词的总频率

5us2dqdw  于 2023-11-14  发布在  其他
关注(0)|答案(5)|浏览(136)

这是我的vector:

posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players.  they have private message boards where it appears most of their work goes on.  i would bet they are posting more there than in jita speakers corner.  i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold.  its sort of like ccp used to post here on the forums then they stopped.  so they got a csm to represent players and use jita park forum to interact.  now the csm no longer posts there as they have their internal forums where they hash things out.  perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)")

字符串
我想要一个data frame作为结果,它将包含单词和它们出现的频率。
因此,结果应该类似于:

word   count
a        300
and      260
be       200
...      ...
...      ...


我尝试使用tm

corpus <- VCorpus(VectorSource(posts))
corpus <-tm_map(corpus, removeNumbers)
corpus <-tm_map(corpus, removePunctuation)
m <- DocumentTermMatrix(corpus)


运行findFreqTerms(m, lowfreq =0, highfreq =Inf )只会给我单词,所以我理解它是一个稀疏矩阵,我如何提取单词及其频率?
有没有更简单的方法来做到这一点,也许根本不使用tm

hwazgwia

hwazgwia1#

posts = c("originally by: cearainmy only concern with csm is they seem a bit insulated from players.  they have private message boards where it appears most of their work goes on.  i would bet they are posting more there than in jita speakers corner.  i think that is unfortunate because its hard to know who to vote for if you never really see what positions they hold.  its sort of like ccp used to post here on the forums then they stopped.  so they got a csm to represent players and use jita park forum to interact.  now the csm no longer posts there as they have their internal forums where they hash things out.  perhaps we need a csm to the csm to find out what they are up to.i don't think you need to worry too much. the csm has had an internal forum for over 2 years, although it is getting used a lot more now than it was. a lot of what goes on in there is nda stuff that we couldn't discuss anyway.i am quite happy to give my opinion on any topic, to the extent that the nda allows, and i" , "fot those of you bleating about imagined nda scandals as you attempt to cast yourselves as the julian assange of eve, here's a quote from the winter summit thread:originally by: sokrateszday 3post dominion 0.0 (3hrs!)if i had to fly to iceland only for this session i would have done it. we had gathered a list of items and prepared it a bit. important things we went over were supercaps, force projection, empire building, profitability of 0.0, objectives for small gangs and of course sovereingty.the csm spent 3 hours talking to ccp about how dominion had changed 0.0, and the first thing on sokratesz's list is supercaps. its not hard to figure out the nature of the discussion.on the other hand, maybe you're right, and the csm's priority for this discussion was to talk about how underpowered and useless supercarriers are and how they needed triple the ehp and dps from their current levels?(it wasn't)")
posts <- gsub("[[:punct:]]", '', posts)  # remove punctuations
posts <- gsub("[[:digit:]]", '', posts)  # remove numbers
word_counts <- as.data.frame(table(unlist( strsplit(posts, "\ ") )))  # split vector by space
word_counts <- with(word_counts, word_counts[ Var1 != "", ] )  # remove empty characters
head(word_counts)
#       Var1 Freq
# 2        a    8
# 3    about    3
# 4   allows    1
# 5 although    1
# 6       am    1
# 7       an    1

字符串

mxg2im7a

mxg2im7a2#

简单的R解决方案,假设所有单词都用空格分隔:

words <- strsplit(posts, " ", fixed = T)
words <- unlist(words)
counts <- table(words)

字符串
names(counts)保存字,值是计数。
您可能希望使用gsub来删除(),.?:'s't're,如示例中所示。

posts <- gsub("'t|'s|'t|'re", "", posts)
posts <- gsub("[(),.?:]", " ", posts)

bwitn5fc

bwitn5fc3#

你有两个选择。取决于你是想要每个文档的字数,还是所有文档的字数。

所有文档

library(dplyr)

count <- as.data.frame(t(inspect(m)))
sel_cols <- colnames(count)
count$word <-  rownames(count)
rownames(count) <- seq(length = nrow(count))
count$count <- rowSums(count[,sel_cols])
count <- count %>% select(word,count)
count <- count[order(count$count, decreasing=TRUE), ]

### RESULT of head(count)

#     word count
# 140  the    14
# 144 they    10
# 4    and     9
# 25   csm     7
# 43   for     5
# 55   had     4

字符串
这应该捕获所有文档中的事件(通过使用rowSum)。

每个文档

我建议使用tidytext包,如果你想每个文档的词频。

library(tidytext)
m_td <- tidy(m)

efzxgjgh

efzxgjgh4#

tidytext包允许相当直观的文本挖掘,包括标记化。它被设计为在tidyverse管道中工作,因此它提供了一个停止词列表(“a”,“the”,“to”等),以排除dplyr::anti_join。在这里,您可以这样做

library(dplyr)    # or if you want it all, `library(tidyverse)`
library(tidytext)

data_frame(posts) %>% 
    unnest_tokens(word, posts) %>% 
    anti_join(stop_words) %>% 
    count(word, sort = TRUE)

## # A tibble: 101 × 2
##        word     n
##       <chr> <int>
## 1       csm     7
## 2       0.0     3
## 3       nda     3
## 4       bit     2
## 5       ccp     2
## 6  dominion     2
## 7     forum     2
## 8    forums     2
## 9      hard     2
## 10 internal     2
## # ... with 91 more rows

字符串

xdyibdwo

xdyibdwo5#

termFreq将返回一个命名向量(名称是单词,值是单词计数):

library(tm)

txt <- PlainTextDocument(VectorSource(posts))
termFreq(txt, control = list(tolower = T, removeNumbers = T, removePunctuation = T))

字符串
或者使用qdap包,它将返回一个 Dataframe :

qdap::freq_terms(posts, top = Inf)

相关问题