R语言 计算每年的令牌数

hk8txs48  于 2022-12-20  发布在  其他
关注(0)|答案(2)|浏览(174)

我写了一个小的R脚本。输入是文本文件(数千篇期刊文章)。我从文件名生成了元数据(包括出版年份)。现在我想计算每年的令牌总数。但是,我在这里没有任何进展。

# Metadata from filenames
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_", 
                        docvarnames = c("Unit", "Year", "Volume", "Issue")) 
# we add some more metadata columns to the data frame
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)
# Corpus
SPARA_corp <- corpus(rawdata_SPARA)

有人知道解决办法吗?
我使用了quanteda包的tokens_by函数,该函数似乎已经过时。

dgiusagp

dgiusagp1#

谢谢!我没能让你的脚本正常工作。但它启发了我开发一个替代解决方案:

# Load the necessary libraries
library(readtext)
library(dplyr)
library(quanteda)

# Set the directory containing the text files
dir <- "/Textfiles/SPARA_paragraphs"

# Read in the text files using the readtext function
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_", docvarnames = c("Unit", "Year", "Volume", "Issue"))

# Extract the year from the file name
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)

# Group the data by year and summarize by tokens
rawdata_SPARA_grouped <- rawdata_SPARA %>% 
    group_by(Year) %>% 
    summarize(tokens = sum(ntoken(text)))

# Print number of absolute tokens per year

print(rawdata_SPARA_grouped)
pbgvytdp

pbgvytdp2#

您不需要将substr(rawdata_SPARA$Year, 0, 4)作为子字符串。在调用readtext函数时,它会从文件名中提取年份。在下面的示例中,文件名具有类似EU_euro_2004_de_PSE.txt的结构,并且2004将自动插入readtext对象中。由于它继承自data.frame,因此您可以使用标准的数据操作函数,例如dplyr包。
然后group_by按年份,summarize按代币,代币数量由quantedantoken函数计算。
请参见下面的代码:

library(readtext)
library(quanteda)

# Prepare sample corpus
set.seed(123)
DATA_DIR <- system.file("extdata/", package = "readtext")
rt <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
                 docvarsfrom = "filenames",
                 docvarnames = c("unit", "context", "year", "language", "party"),
                 encoding = "LATIN1")
rt$year = sample(2005:2007, nrow(rt), replace = TRUE)

# Calculate tokens
rt$tokens <- ntoken(corpus(rt), remove_punct = TRUE)

# Find distribution by year
rt %>% group_by(year) %>% summarize(total_tokens = sum(tokens))

输出:

# A tibble: 3 × 2
   year total_tokens
  <int>        <int>
1  2005         5681
2  2006        26564
3  2007        24119

相关问题