计算Lucene术语文档数量的最快方法

xkftehaa  于 2022-11-07  发布在  Lucene
关注(0)|答案(1)|浏览(271)

我想计算Lucene中某个字段的一个术语的文档数。我很好奇最好和最快的做法是什么:
我将在一个长类型的单值字段(“field”)中搜索该术语(所以不是文本,而是编号数据!)

一些预编码将首先使用以下任何示例:

Directory dirIndex = FSDirectory.open('/path/to/index/');
IndexReader indexReader = DirectoryReader.open(dirIndex);
final BytesRefBuilder bytes = new BytesRefBuilder(); 
NumericUtils.longToPrefixCoded(Long.valueOf(longTerm).longValue(),0,bytes);

1)从索引中使用docFreq()

TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);
termEnum.seekExact(bytes.toBytesRef());
int count = termEnum.docFreq();

2)搜索它

IndexSearcher searcher = new IndexSearcher(indexReader);
TermQuery query = new TermQuery(new Term("field", bytes.toBytesRef()));
TotalHitCountCollector collector = new TotalHitCountCollector();
searcher.search(query,collector);
int count = collector.getTotalHits();

3)从索引中读取完全匹配项,并逐个计算文档数

TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);
termEnum.seekExact(bytes.toBytesRef());
Bits liveDocs = MultiFields.getLiveDocs(indexReader);
DocsEnum docsEnum = termEnum.docs(liveDocs, null);
int count = 0;
if (docsEnum != null) {
    int docx;
    while ((docx = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
        count++;
    }
 }

最佳方法

选项1)赢得了最短的代码,但如果你曾经更新和删除索引中的文档,它基本上是无用的。它计算删除的文档,就像它们仍然存在一样。在许多地方没有文档(除了官方文件,但不是在这里的答案在sideo.)这是需要注意的事情。也许有一个方法可以绕过这一点,否则,对这种方法的热情有点放错地方了。选项2)和3)确实产生了正确的结果,但应该首选哪个?或者更好-有没有更快的方法来完成这个任务?

iqxoj9l9

iqxoj9l91#

通过测试进行衡量,使用索引获取文档而不是搜索文档(即选项3而不是选项2)似乎更快(平均值:选项3)在一个100 doc的示例中快了8倍我可以运行)。我还颠倒了测试,以确保在运行一个之前运行另一个不会影响结果:它不会。
因此,搜索者执行简单的文档计数似乎产生了相当大的开销,如果考虑为单个术语条目计数文档,则在索引中查找是最快的。
用于测试的代码(使用SOLR索引中的前100个记录):

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.DocsEnum;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefBuilder;
import org.apache.lucene.util.NumericUtils;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TotalHitCountCollector;
import org.apache.lucene.util.Bits;
import org.apache.lucene.index.MultiFields;

public class ReadLongTermReferenceCount {

    public static void main(String[] args) throws IOException {

        Directory dirIndex = FSDirectory.open('/path/to/index/');
        IndexReader indexReader = DirectoryReader.open(dirIndex);
        final BytesRefBuilder bytes = new BytesRefBuilder(); 

        TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);

        IndexSearcher searcher = new IndexSearcher(indexReader);
        TotalHitCountCollector collector = new TotalHitCountCollector();

        Bits liveDocs = MultiFields.getLiveDocs(indexReader);
        final BytesRefBuilder bytes = new BytesRefBuilder(); // for reuse!
        int maxDoc = indexReader.maxDoc();
        int docsPassed = 0;
        for (int i=0; i<maxDoc; i++) {
            if (docsPassed==100) {
                break;
            }
            if (liveDocs != null && !liveDocs.get(i))
                continue;
            Document doc = indexReader.document(i);

            //get longTerm from this doc and convert to BytesRefBuilder
            String longTerm = doc.get("longTerm");
            NumericUtils.longToPrefixCoded(Long.valueOf(longTerm).longValue(),0,bytes);

            //time before the first test
            long time_start = System.nanoTime();

            //look in the "field" index for longTerm and count the number of documents
            int count = 0;
            termEnum.seekExact(bytes.toBytesRef());
            DocsEnum docsEnum = termEnum.docs(liveDocs, null);
            if (docsEnum != null) {
                int docx;
                while ((docx = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
                    count++;
                }
            }

            //mid point: test 1 done, start of test 2
            long time_mid = System.nanoTime();

            //do a search for longTerm in "field"
            TermQuery query = new TermQuery(new Term("field", bytes.toBytesRef()));
            searcher.search(query,collector);
            int count = collector.getTotalHits();

            //end point: test 2 done.
            long time_end = System.nanoTime();

            //write to stdout
            System.out.println(longTerm+"\t"+(time_mid-time_start)+"\t"+(time_end-time_mid));

            docsPassed++;
        }
        indexReader.close();
        dirIndex.close();
    }
}

对上述内容稍作修改,以便与Lucene 5一起使用:

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.PostingsEnum;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefBuilder;
import org.apache.lucene.util.NumericUtils;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TotalHitCountCollector;
import org.apache.lucene.util.Bits;
import org.apache.lucene.index.MultiFields;

public class ReadLongTermReferenceCount {

    public static void main(String[] args) throws IOException {

        Directory dirIndex = FSDirectory.open(Paths.get('/path/to/index/'));
        IndexReader indexReader = DirectoryReader.open(dirIndex);
        final BytesRefBuilder bytes = new BytesRefBuilder(); 

        TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);

        IndexSearcher searcher = new IndexSearcher(indexReader);
        TotalHitCountCollector collector = new TotalHitCountCollector();

        Bits liveDocs = MultiFields.getLiveDocs(indexReader);
        final BytesRefBuilder bytes = new BytesRefBuilder(); // for reuse!
        int maxDoc = indexReader.maxDoc();
        int docsPassed = 0;
        for (int i=0; i<maxDoc; i++) {
            if (docsPassed==100) {
                break;
            }
            if (liveDocs != null && !liveDocs.get(i))
                continue;
            Document doc = indexReader.document(i);

            //get longTerm from this doc and convert to BytesRefBuilder
            String longTerm = doc.get("longTerm");
            NumericUtils.longToPrefixCoded(Long.valueOf(longTerm).longValue(),0,bytes);

            //time before the first test
            long time_start = System.nanoTime();

            //look in the "field" index for longTerm and count the number of documents
            int count = 0;
            termEnum.seekExact(bytes.toBytesRef());
            PostingsEnum docsEnum = termEnum.postings(liveDocs, null);
            if (docsEnum != null) {
                int docx;
                while ((docx = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
                    count++;
                }
            }

            //mid point: test 1 done, start of test 2
            long time_mid = System.nanoTime();

            //do a search for longTerm in "field"
            TermQuery query = new TermQuery(new Term("field", bytes.toBytesRef()));
            searcher.search(query,collector);
            int count = collector.getTotalHits();

            //end point: test 2 done.
            long time_end = System.nanoTime();

            //write to stdout
            System.out.println(longTerm+"\t"+(time_mid-time_start)+"\t"+(time_end-time_mid));

            docsPassed++;
        }
        indexReader.close();
        dirIndex.close();
    }
}

相关问题