java—如何忽略附加到单词上的标点和符号，以便在考虑字数时将它们都视为相同的？

h6my8fg2 于 2021-06-30 发布在 Java

关注(0)|答案(1)|浏览(239)

我正在写一个程序，计算任何文本文件中每个单词的字数。这个文件的内容以前是不知道的。
所需输出：例如[book][book！][书-][书？][书，][书的]和喜欢被视为相同的词计数。
当前输出：book=2，book.=1，book--=1，book？=5，book's=3，book，=2，book=1
当我真的在找书的时候=15

try(Stream<String> fileContents = Files.lines(filePath)){

            Function<String, Stream<String>> splitIntoWords = line -> Pattern.compile(" ").splitAsStream(line);

            Map<String, Long> wordFrequency = fileContents.flatMap(splitIntoWords)
                                .filter(word -> word.trim().length() > 4) //Consider only Words with length greater than 4
                                .map(String::toLowerCase)
                                .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

            System.out.println(wordFrequency);
}

我不想硬编码特定的符号和标点符号在regex中忽略，因为文件的确切内容是未知的。
有没有通用的方法来实现这一点？

Java regex String java-8

来源：https://stackoverflow.com/questions/53655445/how-to-ignore-punctuations-and-symbols-appended-to-a-word-so-that-they-are-all