spark和scala中的词干和柠檬化

azpvetkf  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(242)

我使用斯坦福nlp库对一个句子进行词干分析和引理化。例如,汽车是通勤的便捷方式。但是现在路上的车太多了。
因此,预期输出为:

car be easy way commute car road day

但我明白了:

ArrayBuffer(car, easy, way, for, commute, but, there, too, many, car, road, these, day)

这是密码

val stopWords = sc.broadcast(
  scala.io.Source.fromFile("src/main/common-english-words.txt").getLines().toSet).value

def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
  val props = new Properties()
  props.put("annotators", "tokenize, ssplit, pos, lemma")
  val pipeline = new StanfordCoreNLP(props)
  val doc = new Annotation(text)
  pipeline.annotate(doc)
  val lemmas = new ArrayBuffer[String]()
  val sentences = doc.get(classOf[SentencesAnnotation])
  for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
    val lemma = token.get(classOf[LemmaAnnotation])
    if (lemma.length > 2 && !stopWords.contains(lemma)) {
      lemmas += lemma.toLowerCase
    }
  }
  lemmas
}
val lemmatized = stringRDD.map(plainTextToLemmas(_, stopWords))
lemmatized.foreach(println)

我从spark的高级分析书中得到了它,似乎停止词没有被删除,“是”也没有转换成“是”。我们可以从这些库中添加或删除规则吗?
http://www.textfixer.com/resources/common-english-words.txt

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题