使用R的词频列表

6uxekuva  于 2023-11-14  发布在  其他
关注(0)|答案(7)|浏览(136)

我一直在使用tm软件包运行一些文本分析。我的问题是创建一个列表与单词和他们的频率相关联的相同

library(tm)
library(RWeka)

txt <- read.csv("HW.csv",header=T) 
df <- do.call("rbind", lapply(txt, as.data.frame))
names(df) <- "text"

myCorpus <- Corpus(VectorSource(df$text))
myStopwords <- c(stopwords('english'),"originally", "posted")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

#building the TDM

btm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
myTdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = btm))

字符串
我通常使用下面的代码来生成频率范围内的单词列表

frq1 <- findFreqTerms(myTdm, lowfreq=50)


有没有什么方法可以自动化,这样我们就可以得到一个包含所有单词及其频率的框架?
我面临的另一个问题是将术语文档矩阵转换为数据框架。当我处理大量数据样本时,我遇到了内存错误。有简单的解决方案吗?

2w3kk1z5

2w3kk1z51#

试试这个

data("crude")
myTdm <- as.matrix(TermDocumentMatrix(crude))
FreqMat <- data.frame(ST = rownames(myTdm), 
                      Freq = rowSums(myTdm), 
                      row.names = NULL)
head(FreqMat, 10)
#            ST Freq
# 1       "(it)    1
# 2     "demand    1
# 3  "expansion    1
# 4        "for    1
# 5     "growth    1
# 6         "if    1
# 7         "is    2
# 8        "may    1
# 9       "none    2
# 10      "opec    2

字符串

qij5mzcb

qij5mzcb2#

我在R中有以下几行,可以帮助创建单词频率并将它们放在一个表中,它读取.txt格式的文本文件并创建单词的频率,我希望这可以帮助任何感兴趣的人。

avisos<- scan("anuncio.txt", what="character", sep="\n")
avisos1 <- tolower(avisos)
avisos2 <- strsplit(avisos1, "\\W")
avisos3 <- unlist(avisos2)
freq<-table(avisos3)
freq1<-sort(freq, decreasing=TRUE)
temple.sorted.table<-paste(names(freq1), freq1, sep="\\t")
cat("Word\tFREQ", temple.sorted.table, file="anuncio.txt", sep="\n")

字符串

0vvn1miw

0vvn1miw3#

查看findFreqTerms的源代码,似乎函数slam::row_sums在对术语-文档矩阵调用时会发挥作用。例如,尝试:

data(crude)
slam::row_sums(TermDocumentMatrix(crude))

字符串

4ngedf3f

4ngedf3f4#

根据您的需要,使用一些tidyverse函数可能是一个粗略的解决方案,在如何处理大写,标点符号和停用词方面提供了一些灵活性:

text_string <- 'I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same. I typically use the following code for generating list of words in a frequency range. Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors. Is there a simple solution for this?'

stop_words <- c('a', 'and', 'for', 'the') # just a sample list of words I don't care about

library(tidyverse)
data_frame(text = text_string) %>% 
  mutate(text = tolower(text)) %>% 
  mutate(text = str_remove_all(text, '[[:punct:]]')) %>% 
  mutate(tokens = str_split(text, "\\s+")) %>%
  unnest() %>% 
  count(tokens) %>% 
  filter(!tokens %in% stop_words) %>% 
  mutate(freq = n / sum(n)) %>% 
  arrange(desc(n))

# A tibble: 64 x 3
  tokens      n   freq
  <chr>   <int>  <dbl>
1 i           5 0.0581
2 with        5 0.0581
3 is          4 0.0465
4 words       3 0.0349
5 into        2 0.0233
6 list        2 0.0233
7 of          2 0.0233
8 problem     2 0.0233
9 run         2 0.0233
10 that       2 0.0233
# ... with 54 more rows

字符串

mzsu5hc0

mzsu5hc05#

a = scan(file='~/Desktop//test.txt',what="list")
a1 = data.frame(lst=a)
count(a1,vars="lst")

字符串
似乎工作,以获得简单的频率。我用扫描,因为我有一个txt文件,但它应该与read.csv太。

fnx2tebb

fnx2tebb6#

apply(myTdm, 1, sum)rowSums(as.matrix(myTdm))是否给予您想要的ngram计数?

xt0899hw

xt0899hw7#

使用qdap包:

text_string <- 'I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same. I typically use the following code for generating list of words in a frequency range. Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors. Is there a simple solution for this?'

qdap::freq_terms(text_string, stopwords = c("of", "a", "i", "with", "is"), top = Inf)
#    WORD        FREQ
# 1  the            5
# 2  words          3
# 3  and            2
# 4  data           2
# 5  for            2
# 6  frequency      2
# 7  into           2
# 8  list           2
# 9  problem        2
# 10 run            2
# ...

字符串
默认情况下,该函数只显示前20名,因此我们设置top = Inf。您可以将任何字符向量传递给stopwords参数:c(tm::stopwords(), "other", "stop", "words")

相关问题