在quanteda的corpus_reshape中，如何排除句点前的某些单词用作句子中断？

envsm3lx 于 2023-02-17 发布在其他

关注(0)|答案(1)|浏览(124)

在某些情况下，使用corpus_reshape时，某些句点会被错误地用作断句。我有一个来自制药行业的语料库，在许多情况下，“Dr.”被错误地用作断句。这篇文章（Quanteda's corpus_reshape function: how not to break sentences after abbreviations (like "e.g.")）与此类似，但不幸的是，它解决了这个问题。下面是一个例子：

library("quanteda")
    
    txt <- c(
      d1 = "With us we have Dr. Smith. We are not sure... where we stand.",
      d2 = "The U.S. is south of Canada."
    )
    corpus(txt) %>%
      corpus_reshape(to = "sentences")

语料库由4份文件组成。d1.1：“和我们在一起的是Dr.”
d1.2：“史密斯”。
d1.3：“我们不确定.我们的立场”
d2.1：“美国在加拿大南部。”
它只适用于“Dr."的少数情况。我想知道某些要排除的单词是否可以添加到函数中，因为我想避免使用替代函数将文本分解成句子。谢谢！

r

来源：https://stackoverflow.com/questions/75470895/how-can-you-exclude-certain-words-before-periods-from-being-used-as-sentence-bre