lucene如何获取找到的查询的位置?

6mw9ycah  于 2021-08-20  发布在  Java
关注(0)|答案(1)|浏览(413)

我有一个queryparser,我想在我的文本中找到字符串“war force”:

TextWord[0]: 2003
TextWord[1]: 09
TextWord[2]: 22T19
TextWord[3]: 01
TextWord[4]: 14Z
TextWord[5]: Book0
TextWord[6]: WEAPONRY
TextWord[7]: NATO2
TextWord[8]: Bar
TextWord[9]: WEAPONRY
TextWord[10]: State
TextWord[11]: WEAPONRY
TextWord[12]: 123
TextWord[13]: War
TextWord[14]: WORD1
TextWord[15]: Force
TextWord[16]: And
TextWord[17]: Book4
TextWord[18]: Book
TextWord[19]: WEAPONRY
TextWord[20]: Book6
TextWord[21]: Terrorist.
TextWord[22]: And
TextWord[23]: WEAPONRY
TextWord[24]: 18
TextWord[25]: 31
TextWord[26]: state
TextWord[27]: AND

当使用短语slop=1(我的意思是:“战争”一词1“力量”)时,我发现了它。
我可以找到“战争”或“武力”的位置:

DirectoryReader reader = DirectoryReader.open(this.memoryIndex);
        IndexSearcher searcher = new IndexSearcher(reader);

        QueryParser queryParser = new QueryParser("tags", new StandardAnalyzer());
        Query query = queryParser.parse("\"War Force\"~1");
        TopDocs results = searcher.search(query, 1);

        for (ScoreDoc scoreDoc : results.scoreDocs) {

            Fields termVs = reader.getTermVectors(scoreDoc.doc);
            Terms f = termVs.terms("tags");

            String searchTerm = "War".toLowerCase();
            BytesRef ref = new BytesRef(searchTerm);

            TermsEnum te = f.iterator();
            PostingsEnum docsAndPosEnum = null;
            if (te.seekExact(ref)) {

                docsAndPosEnum = te.postings(docsAndPosEnum, PostingsEnum.ALL);
                int nextDoc = docsAndPosEnum.nextDoc();
                assert nextDoc != DocIdSetIterator.NO_MORE_DOCS;
                final int fr = docsAndPosEnum.freq();
                final int p = docsAndPosEnum.nextPosition();
                final int o = docsAndPosEnum.startOffset();

                System.out.println("Word: " + ref.utf8ToString());
                System.out.println("Position: " + p + ", startOffset: " + o + " length: " + ref.length + " Freg: " + fr);
                if (fr > 1) {
                    for (int iter = 1; iter <= fr - 1; iter++) {
                        System.out.println("Possition: " + docsAndPosEnum.nextPosition());
                    }
                }
            }

            System.out.println("Finish");
        }

但是我找不到我的查询“战争力量”之类的东西的位置。如何获取找到的查询结果的位置?

z31licg0

z31licg01#

可能有不止一种方法可以做到这一点,但我建议使用 FastVectorHighlighter ,因为它允许您访问位置和偏移数据。
索引要求
要使用这种方法,您需要确保索引数据在创建索引时使用存储术语向量数据的字段:

final String fieldName = "body";
// a shorter version of the input data in the question, for testing:
final String content = "State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY";

FieldType fieldType = new FieldType();
fieldType.setStored(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setStoreTermVectorOffsets(true);

doc.add(new Field(fieldName, content, fieldType));
writer.addDocument(doc);

(如果尚未捕获术语向量,这可能会显著增加索引数据的大小。)
图书馆要求
快速矢量荧光灯是系统的一部分 lucene-highlighter 图书馆:

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-highlighter</artifactId>
    <version>8.9.0</version>
</dependency>

搜索示例
假设以下查询:

final String searchTerm = "\"War Force\"~1";

我们希望这能找到答案 War WORD1 Force 根据我们的测试数据。
流程的第一部分使用经典查询解析器执行标准查询执行:

Directory dir = FSDirectory.open(Paths.get(indexPath));
try ( DirectoryReader dirReader = DirectoryReader.open(dir)) {
    IndexSearcher indexSearcher = new IndexSearcher(dirReader);
    Analyzer analyzer = new StandardAnalyzer();
    QueryParser parser = new QueryParser(fieldName, analyzer);
    Query query = parser.parse(searchTerm);
    TopDocs topDocs = indexSearcher.search(query, 100);
    ScoreDoc[] hits = topDocs.scoreDocs;
    for (ScoreDoc hit : hits) {
        handleHit(hit, query, dirReader, indexSearcher);
    }

这个 handleHit() 方法(如下所示)是我们使用 FastVectorHighlighter .
如果只想执行高亮显示(不需要位置/偏移数据),可以使用:

FastVectorHighlighter fvh = new FastVectorHighlighter();
fvh.getBestFragment(fieldQuery, dirReader, docId, fieldName, fragCharSize)

但要访问我们需要的额外数据,您可以执行以下操作:

FieldTermStack fieldTermStack = new FieldTermStack(dirReader, hit.doc, fieldName, fieldQuery);
FieldPhraseList fieldPhraseList = new FieldPhraseList(fieldTermStack, fieldQuery);
FragListBuilder fragListBuilder = new SimpleFragListBuilder();
FragmentsBuilder fragmentsBuilder = new SimpleFragmentsBuilder();
FastVectorHighlighter fvh = new FastVectorHighlighter(phraseHighlight, fieldMatch,
        fragListBuilder, fragmentsBuilder);

这建立了一个 FastVectorHighlighter 其中包含一个 FieldPhraseList ,将由荧光灯填充。
这个 getBestFragment 方法现在变为:

// use whatever you want for these settings:
int fragCharSize = 100;
int maxNumFragments = 100;
String[] preTags = new String[]{"-->"};
String[] postTags = new String[]{"<--"};

Encoder encoder = new DefaultEncoder();
// the fragments string array contains the highlighted results:
String[] fragments = fvh.getBestFragments(fieldQuery, dirReader, hit.doc,
        fieldName, fragCharSize, maxNumFragments, fragListBuilder,
        fragmentsBuilder, preTags, postTags, encoder);

最后我们可以使用 fieldPhraseList 要访问我们需要的数据,请执行以下操作:

// the following gives you access to positions and offsets:
fieldPhraseList.getPhraseList().forEach(weightedPhraseInfo -> {
    int phraseStartOffset = weightedPhraseInfo.getStartOffset(); // 19
    int phraseEndOffset = weightedPhraseInfo.getEndOffset();     // 34
    weightedPhraseInfo.getTermsInfos().forEach(termInfo -> {
        String term = termInfo.getText();                // "war"  "force"
        int termPosition = termInfo.getPosition() + 1;    // 4      6
        int termStartOffset = termInfo.getStartOffset(); // 19     29
        int termEndOffset = termInfo.getEndOffset();     // 22     34
    });
});

这个 phraseStartOffsetphraseEndOffset 字符计数是否告诉我们整个短语在源文档中的位置:

State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY

因此,在我们的例子中,这是偏移量19到34之间的字符串(偏移量0是第一个“s”左侧的位置)。
然后,对于搜索查询中的每个特定术语(“战争”和“武力”),我们可以访问它们的偏移量,以及它们的单词位置( termPosition ). 位置0是forst单词,因此我将1添加到此索引中,以在原始文档中的位置4处显示“war”,在位置6处显示“force”:

1     2        3   4   5     6     7   8     9    10
State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY

以下是完整的代码供参考:

import java.io.IOException;
import java.math.BigDecimal;
import java.math.RoundingMode;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.DefaultEncoder;
import org.apache.lucene.search.highlight.Encoder;
import org.apache.lucene.search.vectorhighlight.FastVectorHighlighter;
import org.apache.lucene.search.vectorhighlight.FieldPhraseList;
import org.apache.lucene.search.vectorhighlight.FieldQuery;
import org.apache.lucene.search.vectorhighlight.FieldTermStack;
import org.apache.lucene.search.vectorhighlight.FragListBuilder;
import org.apache.lucene.search.vectorhighlight.FragmentsBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragListBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragmentsBuilder;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class VectorIndexHighlighterDemo {

    final String indexPath = "./index";
    final String fieldName = "body";
    final String searchTerm = "\"War Force\"~1";

    public void doDemo() throws IOException, ParseException {

        Directory dir = FSDirectory.open(Paths.get(indexPath));
        try ( DirectoryReader dirReader = DirectoryReader.open(dir)) {
            IndexSearcher indexSearcher = new IndexSearcher(dirReader);
            Analyzer analyzer = new StandardAnalyzer();
            QueryParser parser = new QueryParser(fieldName, analyzer);
            Query query = parser.parse(searchTerm);

            System.out.println();
            System.out.println("Search term: [" + searchTerm + "]");
            System.out.println("Parsed query: [" + query.toString() + "]");

            TopDocs topDocs = indexSearcher.search(query, 100);

            ScoreDoc[] hits = topDocs.scoreDocs;
            for (ScoreDoc hit : hits) {
                handleHit(hit, query, dirReader, indexSearcher);
            }
        }
    }

    private void handleHit(ScoreDoc hit, Query query, DirectoryReader dirReader,
            IndexSearcher indexSearcher) throws IOException {

        boolean phraseHighlight = Boolean.TRUE;
        boolean fieldMatch = Boolean.TRUE;
        FieldQuery fieldQuery = new FieldQuery(query, dirReader, phraseHighlight, fieldMatch);

        FieldTermStack fieldTermStack = new FieldTermStack(dirReader, hit.doc, fieldName, fieldQuery);
        FieldPhraseList fieldPhraseList = new FieldPhraseList(fieldTermStack, fieldQuery);
        FragListBuilder fragListBuilder = new SimpleFragListBuilder();
        FragmentsBuilder fragmentsBuilder = new SimpleFragmentsBuilder();
        FastVectorHighlighter fvh = new FastVectorHighlighter(phraseHighlight, fieldMatch,
                fragListBuilder, fragmentsBuilder);

        // use whatever you want for these settings:
        int fragCharSize = 100;
        int maxNumFragments = 100;
        String[] preTags = new String[]{"-->"};
        String[] postTags = new String[]{"<--"};

        Encoder encoder = new DefaultEncoder();
        // the fragments string array contains the highlighted results:
        String[] fragments = fvh.getBestFragments(fieldQuery, dirReader, hit.doc,
                fieldName, fragCharSize, maxNumFragments, fragListBuilder,
                fragmentsBuilder, preTags, postTags, encoder);

        // the following gives you access to positions and offsets:
        fieldPhraseList.getPhraseList().forEach(weightedPhraseInfo -> {
            int phraseStartOffset = weightedPhraseInfo.getStartOffset(); // 19
            int phraseEndOffset = weightedPhraseInfo.getEndOffset();     // 34
            weightedPhraseInfo.getTermsInfos().forEach(termInfo -> {
                String term = termInfo.getText();                // "war"  "force"
                int termPosition = termInfo.getPosition() + 1;    // 4      6
                int termStartOffset = termInfo.getStartOffset(); // 19     29
                int termEndOffset = termInfo.getEndOffset();     // 22     34
            });
        });

        // get the scores, also, if needed:
        BigDecimal score = new BigDecimal(String.valueOf(hit.score))
                .setScale(3, RoundingMode.HALF_EVEN);
        Document hitDoc = indexSearcher.doc(hit.doc);
    }

}

相关问题