java 如何在文本文件中找到词频？

kdfy810k 于 2023-03-21 发布在 Java

关注(0)|答案(4)|浏览(118)

我的任务是得到这个文件的词频：

test_words_file-1.txt：

The quick brown fox
Hopefully245this---is   a quick13947
task&&#%*for you to complete.
But maybe the tASk 098234 will be less
..quicK.
the the the the the the the the the the

我一直试图从这个文件中删除符号和数字，并按字母顺序获得每个单词的频率，结果是：

我可以看到偶数位数已被删除，但仍在计数。您能解释为什么以及如何修复此问题吗？
另外，我如何将 “Hopefully 245 this---is” 分开并存储3个有用的单词 “hopefully”，“this”，“is”？

public class WordFreq2 {
    public static void main(String[] args) throws FileNotFoundException {

        File file = new File("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt");
        Scanner scanner = new Scanner(file); 
        int maxWordLen = 0; 
        String maxWord = null;

        HashMap<String, Integer> map = new HashMap<>();
        while(scanner.hasNext()) {
            String word = scanner.next();
            word = word.toLowerCase();
            // text cleaning 
            word = word.replaceAll("[^a-zA-Z]+", "");

            if(map.containsKey(word)) {
                //if the word already exists
                int count = map.get(word)+1;
                map.put(word,count);
            }
            else {
                // The word is new 
                int count = 1;
                map.put(word, count);

                // Find the max length of Word
                if (word.length() > maxWordLen) {
                    maxWordLen = word.length();
                    maxWord = word;
                }
            }   
        }

        scanner.close();

        //HashMap unsorted, sort 
        TreeMap<String, Integer> sorted = new TreeMap<>();
        sorted.putAll(map);

        for (Map.Entry<String, Integer> entry: sorted.entrySet()) {
            System.out.println(entry);
        }

        System.out.println(maxWordLen+" ("+maxWord+")");
    }

}

Java

来源：https://stackoverflow.com/questions/61988535/how-to-find-word-frequency-in-a-text-file

4条答案

按热度按时间

ubbxdtey1#

首先是代码。解释出现在下面的代码之后。

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WordFreq2 {

    public static void main(String[] args) {
        Path path = Paths.get("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt");
        try {
            String text = Files.readString(path); // throws java.io.IOException
            text = text.toLowerCase();
            Pattern pttrn = Pattern.compile("[a-z]+");
            Matcher mtchr = pttrn.matcher(text);
            TreeMap<String, Integer> freq = new TreeMap<>();
            int longest = 0;
            while (mtchr.find()) {
                String word = mtchr.group();
                int letters = word.length();
                if (letters > longest) {
                    longest = letters;
                }
                if (freq.containsKey(word)) { 
                    freq.computeIfPresent(word, (w, c) -> Integer.valueOf(c.intValue() + 1));
                }
                else {
                    freq.computeIfAbsent(word, (w) -> Integer.valueOf(1));
                }
            }
            String format = "%-" + longest + "s = %2d%n";
            freq.forEach((k, v) -> System.out.printf(format, k, v));
            System.out.println("Longest = " + longest);
        }
        catch (IOException xIo) {
            xIo.printStackTrace();
        }
    }
}

由于您的示例文件很小，所以我将整个文件内容加载到String中。
然后我将整个String转换为小写，因为单词的定义是一系列连续的字母，不区分大小写的字符。
正则表达式-[a-z]+-搜索一个或多个连续的小写字母字符（请记住，整个String现在都是小写）。
每次连续调用方法find()都会在String中找到下一个单词（根据上面单词的定义，即字母表中连续的一系列小写字母）。
为了计算字母频率，我使用TreeMap，其中Map键是单词，Map值是单词在String中出现的次数。注意，Map键和值不能是原语，因此值是Integer而不是int。
如果找到的最后一个单词已经出现在Map中，则增加计数。
如果找到的最后一个单词未出现在Map中，则将其添加到Map中，并将其计数设置为1（一）。
沿着将单词添加到Map中，我还计算了找到的每个单词的字母，以便找到最长的单词。
在处理完整个String之后，我打印Map的内容，每行一个条目，最后打印找到的最长单词的字母数。注意，TreeMap对它的键进行排序，因此单词列表按字母顺序显示。
下面是输出：

a         =  1
be        =  1
brown     =  1
but       =  1
complete  =  1
for       =  1
fox       =  1
hopefully =  1
is        =  1
less      =  1
maybe     =  1
quick     =  3
task      =  2
the       = 12
this      =  1
to        =  1
will      =  1
you       =  1
Longest = 9

赞(0）回复(0）举报 2023-03-21

ki0zmccv2#

我怎样才能把“hopefully 245 this---is”分开并存储3个有用的单词“hopefully”，“this”，“is”？
使用regex API来满足这样的要求。

演示：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        String str = "Hopefully245this---is";
        Pattern pattern = Pattern.compile("[A-Za-z]+");
        Matcher matcher = pattern.matcher(str);
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

输出：

Hopefully
this
is

查看以下链接以了解有关Java正则表达式的更多信息：

赞(0）回复(0）举报 2023-03-21

jw5wzhpr3#

在Java 9或更高版本的Matcher中，#结果可以在流解决方案中使用，如下所示：

Pattern pattern = Pattern.compile("[a-zA-Z]+");
    try (BufferedReader br = Files.newBufferedReader(Paths.get("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt"))) {
        br.lines()
                .map(pattern::matcher)
                .flatMap(Matcher::results)
                .map(matchResult -> matchResult.group(0))
                .collect(Collectors.groupingBy(String::toLowerCase, TreeMap::new, Collectors.counting()))
                .forEach((word, count) -> System.out.printf("%s=%s%n", word, count));
    } catch (IOException e) {
        System.err.format("IOException: %s%n", e);
    }

输出：

a=1
be=1
brown=1
but=1
complete=1
for=1
fox=1
hopefully=1
is=1
less=1
maybe=1
quick=3
task=2
the=12
this=1
to=1
will=1
you=1

赞(0）回复(0）举报 2023-03-21

juzqafwq4#

import java.io.File;
import java.io.FileNotFoundException;
import java.util.HashMap;
import java.util.Map;
import java.util.Scanner;
 
public class test
{
  public static void main(String[] args) throws FileNotFoundException
  {
    File f = new File("C:\\Users\\Nandini\\Downloads\\CountFreq.txt");
    Scanner s = new Scanner(f);
    Map<String, Integer> counts = new HashMap<String, Integer>(); 
    while( s.hasNext() )
    {
             String word = s.next();
             word = word.toLowerCase();
            if( !counts.containsKey( word ) )
             counts.put( word, 1 );
            else
             counts.put( word, counts.get(word) + 1 );
    }
    System.out.println(counts);
  }
  
}

输出：{the=1，this=3，have=1，is=2，word=1}

赞(0）回复(0）举报 2023-03-21

我来回答

java 如何在文本文件中找到词频？

4条答案

相关问题

热门标签

最新问答