如何优化全文索引的倒排文件?

hmmo2u0o  于 2021-06-19  发布在  Mysql
关注(0)|答案(1)|浏览(378)

我正在做一个简单的程序,在那里我使用一个pdf文件的样本,建立一个对我的数据库全文索引。我的想法是读取每个pdf文件,提取单词并将它们存储在哈希集中。
然后,将循环中的每个单词连同其文件路径一起添加到mysql中的表中。因此,每个单词都循环存储在每一列中,直到它完成。它工作得很好。但是,对于包含成千上万个单词的大型pdf文件,可能需要花费一些时间来建立索引表,换句话说,由于单词提取速度很快,因此需要很长时间才能将每个单词保存到数据库中。
代码:

public class IndexTest {

public static void main(String[] args) throws Exception {
    // write your code here
    //String path ="D:\\Full Text Indexing\\testIndex\\bell2009a.pdf";
    // HashSet<String> uniqueWords = new HashSet<>();
    /*StopWatch stopwatch = new StopWatch();
    stopwatch.start();*/
    File folder = new File("D:\\PDF1");
    File[] listOfFiles = folder.listFiles();

    for (File file : listOfFiles) {
        if (file.isFile()) {
            HashSet<String> uniqueWords = new HashSet<>();
            String path = "D:\\PDF1\\" + file.getName();
            try (PDDocument document = PDDocument.load(new File(path))) {

                if (!document.isEncrypted()) {

                    PDFTextStripper tStripper = new PDFTextStripper();
                    String pdfFileInText = tStripper.getText(document);
                    String lines[] = pdfFileInText.split("\\r?\\n");
                    for (String line : lines) {
                        String[] words = line.split(" ");

                        for (String word : words) {
                            uniqueWords.add(word);

                        }

                    }
                    // System.out.println(uniqueWords);

                }
            } catch (IOException e) {
                System.err.println("Exception while trying to read pdf document - " + e);
            }
            Object[] words = uniqueWords.toArray();
            String unique = uniqueWords.toString();
            //  System.out.println(words[1].toString());

            for(int i = 1 ; i <= words.length - 1 ; i++ ) {
                MysqlAccessIndex connection = new MysqlAccessIndex();
                connection.readDataBase(path, words[i].toString());

            }

            System.out.println("Completed");

        }
    }

sql连接代码:

public class MysqlAccessIndex {

      public MysqlAccessIndex() throws Exception {
        Class.forName("com.mysql.jdbc.Driver");
        connect = DriverManager
                .getConnection("jdbc:mysql://126.32.3.178/fulltext_ltat?"
                        + "user=root&password=root123");
      //  statement = connect.createStatement();
        System.out.print("Connected");
    }

    public void readDataBase(String path,String word) throws Exception {
        try {

            statement = connect.createStatement();
            System.out.print("Connected");

            preparedStatement = connect
                    .prepareStatement("insert IGNORE into  fulltext_ltat.test_text values (?, ?) ");

            preparedStatement.setString(1, path);
            preparedStatement.setString(2, word);
            preparedStatement.executeUpdate();
            // resultSet = statement
            //.executeQuery("select * from fulltext_ltat.index_detail");

            //  writeResultSet(resultSet);
        } catch (Exception e) {
            throw e;
        } finally {
            close();
        }

    }

是否有改进或优化性能问题的建议?

xcitsw88

xcitsw881#

问题在于以下代码:

// This will load the MySQL driver, each DB has its own driver
Class.forName("com.mysql.jdbc.Driver");
// Setup the connection with the DB
connect = DriverManager.getConnection(
        "jdbc:mysql://126.32.3.20/fulltext_ltat?" + "user=root&password=root");

您正在为插入到数据库中的每个单词重新创建连接。更好的方法是这样:

public MysqlAccess() {
    connect = DriverManager
                .getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?"
                        + "user=root&password=root");
}

这样你只会创造 connect 第一次创建该类的示例时。在你的 main 方法创建 MysqlAccess 示例,所以只创建一次。 MysqlAccess 会像这样:

public class MysqlAccess {

    private Connection connect = null;
    private Statement statement = null;
    private PreparedStatement preparedStatement = null;
    private ResultSet resultSet = null;

    public MysqlAccess() {
        // Setup the connection with the DB
        connect = DriverManager.getConnection(
                "jdbc:mysql://126.32.3.20/fulltext_ltat?" + "user=root&password=root");
    }

    public void readDataBase(String path, String word) throws Exception {
        try {
            // Statements allow to issue SQL queries to the database
            statement = connect.createStatement();
            System.out.print("Connected");
            // Result set get the result of the SQL query

            preparedStatement = connect.prepareStatement(
                    "insert IGNORE into  fulltext_ltat.test_text values (default,?, ?) ");

            preparedStatement.setString(1, path);
            preparedStatement.setString(2, word);
            preparedStatement.executeUpdate();

        } catch (Exception e) {
            throw e;
        } finally {
            close();
        }

    }

    private void writeResultSet(ResultSet resultSet) throws SQLException {
        // ResultSet is initially before the first data set
        while (resultSet.next()) {
            // It is possible to get the columns via name
            // also possible to get the columns via the column number
            // which starts at 1
            // e.g. resultSet.getSTring(2);
            String path = resultSet.getString("path");
            String word = resultSet.getString("word");

            System.out.println();
            System.out.println("path: " + path);
            System.out.println("word: " + word);

        }
    }
}

相关问题