Lucene SpanNearQuery部分匹配

yacmzcpb 于 2022-11-07 发布在 Lucene

关注(0)|答案(1)|浏览(169)

给定一个文档{'foo'，'bar'，'baz'}，我想使用SpanNearQuery和标记{'baz'，'extra'}进行匹配
但这失败了。
我该怎么办？
样品测试（使用lucene 2.9.1），结果如下：

给定单场比赛-通过
给定两个匹配-通过
给定三场比赛-通过
给定单个匹配项和额外项-失败

...

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;

public class SpanNearQueryTest {

    private RAMDirectory directory = null;

    private static final String BAZ = "baz";
    private static final String BAR = "bar";
    private static final String FOO = "foo";
    private static final String TERM_FIELD = "text";

    @Before
    public void given() throws IOException {
        directory = new RAMDirectory();
        IndexWriter writer = new IndexWriter(
                directory,
                new StandardAnalyzer(Version.LUCENE_29),
                IndexWriter.MaxFieldLength.UNLIMITED);

        Document doc = new Document();
        doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED));

        writer.addDocument(doc);
        writer.commit();
        writer.optimize();
        writer.close();
    }

    @After
    public void cleanup() {
        directory.close();
    }

    @Test
    public void givenSingleMatch() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenTwoMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenThreeMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenSingleMatch_andExtraTerm() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ)),
                        new SpanTermQuery(new Term(TERM_FIELD, "EXTRA"))
                },
                Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }
}

lucene

来源：https://stackoverflow.com/questions/2021839/lucene-spannearquery-partial-matching

1条答案

按热度按时间

avkwfej41#

SpanNearQuery可让您寻找彼此之间在特定距离内的词汇。
示例（来自http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/）：
假设我们想在doug的5个位置内找到lucene，doug在lucene之后（顺序很重要）--你可以使用下面的SpanQuery：

new SpanNearQuery(new SpanQuery[] {
  new SpanTermQuery(new Term(FIELD, "lucene")),
  new SpanTermQuery(new Term(FIELD, "doug"))},
  5,
  true);

（来源：lucidimagination.com）
在本示例文本中，Lucene与Doug的距离在3以内
但是对于您的示例，我能看到的唯一匹配是查询和目标文档都有“cd”（我假设所有这些词都在一个字段中）。在这种情况下，您不需要使用任何特殊的查询类型。使用标准机制，您将获得一些非零权重，这是基于它们都在同一字段中包含相同的词这一事实。

编辑3-作为对最新评论的回应，答案是您不能使用SpanNearQuery来做它的预期用途之外的任何事情，即找出文档中的多个术语是否在一定数量的位置内彼此出现。我不能告诉您具体的用例/预期结果是什么（请随意发布），但在最后一种情况下，如果您只想找出文档中是否有一个或多个（“BAZ”，“EXTRA”），一个X1 M1 N1 X就可以了。
编辑4-现在您已经发布了您的用例，我知道您想要做什么了。下面是您可以做的方法：使用上面提到的BooleanQuery合并您想要的各个项以及SpanNearQuery，并在SpanNearQuery上设置增强。

因此，文本形式的查询如下所示：

BAZ OR EXTRA OR "BAZ EXTRA"~100^5

(as例如，这将匹配包含“BAZ”或“EXTRA”的所有文档，但是向其中术语“BAZ”和“EXTRA”在彼此的100个位置内出现的文档分配较高的分数;这个例子来自Solr的食谱，所以它可能不会在Lucene中解析，或者可能会给予不希望的结果。这没关系，因为在下一节我将向您展示如何使用API构建它。
以程序设计方式建构，如下所示：

Query top = new BooleanQuery();

// Construct the terms since they will be used more than once
Term bazTerm = new Term("Field", "BAZ");
Term extraTerm = new Term("Field", "EXTRA");

// Add each term as "should" since we want a partial match
top.add(new TermQuery(bazTerm), BooleanClause.Occur.SHOULD);
top.add(new TermQuery(extraTerm), BooleanClause.Occur.SHOULD);

// Construct the SpanNearQuery, with slop 100 - a document will get a boost only
// if BAZ and EXTRA occur within 100 places of each other.  The final parameter means
// that BAZ must occur before EXTRA.
SpanNearQuery spanQuery = new SpanNearQuery(
                              new SpanQuery[] { new SpanTermQuery(bazTerm), 
                                                new SpanTermQuery(extraTerm) }, 
                              100, true);

// Give it a boost of 5 since it is more important that the words are together
spanQuery.setBoost(5f);

// Add it as "should" since we want a match even when we don't have proximity
top.add(spanQuery, BooleanClause.Occur.SHOULD);

希望这对你有帮助！以后，试着从准确地发布你所期望的结果开始--即使对你来说很明显，对读者来说可能不是，明确可以避免反复这么多次。

赞(0）回复(0）举报 2022-11-07

我来回答

Lucene SpanNearQuery部分匹配

1条答案

相关问题

热门标签

最新问答