寻找重复的最佳算法是什么?

f45qwnt8  于 2021-06-20  发布在  Mysql
关注(0)|答案(1)|浏览(242)

让我们看看我的代码:

function checkForDuplicates() {            
           $data = $this->input->post();
           $project_id = $data['project_id'];

           $this->db->where('project_id', $project_id);
           $paper = $this->db->get('paper')->result();

           $paper2 = $paper; //duplica o array de papers
           $duplicatesCount = 0;

           foreach($paper as $p){
               $similarity = null;

                foreach($paper2 as $p2){
                    if($p -> status_selection_id !== 4 && $p2 -> status_selection_id !== 4){ 
                        if($p -> paper_id !== $p2 -> paper_id){ 
                            similar_text($p -> title, $p2 -> title, $similarity);

                            if ($similarity > 90) { 
                                $p -> status_selection_id = 4;
                                $this->db->where('paper_id', $p -> paper_id);
                                $this->db->update('paper', $p);
                                $duplicatesCount ++;
                            }
                        }
                    }
                }
            }

            $data = array(
                'duplicatesCount' => $duplicatesCount,
                'message' => 'Duplicates where found!'
            );
            echo json_encode($data);
        }

类似的文本需要180秒来检查1500条记录。
levenshtein需要101秒来检查1500条记录。
if($pp1==$pp2)需要45秒来检查1500条记录。
检查重复记录并更改其状态的最快方法是什么?

cgfeq70w

cgfeq70w1#

优化通常是减少io。
在您的情况下,减少sql查询的数量应该可以提高处理时间。
如果你需要处理大量的记录,你应该把它分割成块。每个区块应该包含一批可以放入内存(ram)的记录。
从数据库中检索块。处理块(即使用循环),并使用数组(ie)跟踪需要在db中执行的更改。最后,用尽可能少的查询批量更新数据库。

$data = $this->input->post();
       $project_id = $data['project_id'];

       $this->db->where('project_id', $project_id);
       $paper = $this->db->get('paper')->result();

       $paper2 = $paper; //duplica o array de papers
       $duplicatesCount = 0;

       // keep track of updates
       $updates = [];

       foreach($paper as $p){
           $similarity = null;

            foreach($paper2 as $p2){
                if($p -> status_selection_id !== 4 && $p2 -> status_selection_id !== 4){ 
                    if($p -> paper_id !== $p2 -> paper_id){ 
                        similar_text($p -> title, $p2 -> title, $similarity);

                        if ($similarity > 90) { 

                            $updates[] = [
                                'paper_id' => $p -> paper_id,
                                'status_selection_id' => 4,
                            ];

                            $duplicatesCount ++;
                        }
                    }
                }
            }
        }

        if ($duplicatesCount > 0) {
             // here you have to create a big SQL request with all the updates
             // maybe your DB adaptor can do it for you ?
             $query = $this->db->somethingToCreateABulkQuery();
             foreach ($updates as $update) {
                 // stuff 
                 $query->somethingToAddAndUpdate($update);
             }
             $this->db->somethingToExecuteTheQuery($query);

        }

相关问题