lucene 是否使用整数的DocValues对索引排序？

我正在使用Lucene实现一个文本字段的自动完成机制，它支持多种语言和多组选项，每组有大约2k到5 k个不同的值。
目前，我查询所有的点击数，并根据一个整数值手工排序。由于这是低效的，我需要使用doc-values创建一个索引。我了解这个理论，但我找不到一个好的代码片段来使它工作。我带来并阅读了两本书，它要么没有或很差（一小部分一行代码）。
我的目标是为每个文档索引一个整数值，并按降序排序。
我还想问一下，我是否错过了一个市长文档源？Lucene文档不是那么全面，也不是那么容易访问。我曾经在行动中使用Lucene，但这本书已经有十年的历史了，Lucene中最近的变化在API方面相当戏剧性。
例如：

{品名：“A1”，货号：1000}
{品名：“A2”，货号：1001}
{品名：“A3”，货号：990}
{品名：“B1”，数量：300个}

=查询：A* +按编号排序+ top2 =〉A3，A1
总结：我目前在代码中获取所有的文档并进行排序和修整（限制），我更希望Lucene来做这件事。
实现使用Java。由于我只使用了一小部分信息，但在多种语言中，我使用RAMDirectory创建了一个索引（是的，我知道它被弃用了，但它可以工作），并使用标准分析器将每个文档添加到标准索引编写器中。
就我理解的需求而言，我需要定义并使用一个存储在列中的字段，以允许使用Lucene进行排序。我尝试了几个小时，只是放弃了获取所有信息，并在内存中查找数据，然后进行排序+修剪。它确实起到了作用，但并不令人满意。
因此，所需要的只是在索引中添加一个整数字段，以便在lucene中进行排序。

你是对的，Lucene的文档可能有点挑战性，因为《Lucene In Action》一书的最后一个修订版是3.0版本，而Lucene 4.0中有非常重大的变化。我找到了一本书，它介绍了Lucene 4，名为Lucene 4 Cookbook，但它不是很仔细，只是它的覆盖范围仅限于一页，但它确实提供了一个示例。
学习Lucene的一个很好的来源是与项目一起存储的单元测试。这就是我找到下面例子的地方。这个例子展示了如何将你的数字存储为NumericDocValue，然后按它排序。单元测试通常不适合剪切和粘贴应用程序使用，但它们很好地展示了我们如何使用这个特性。例如，这个单元测试使用了RandomIndexWriter，而你使用的是RandomIndexWriter。d使用IndexWriter。
这种排序方法利用了DocValues。关于DocValues，有一点要记住，它们不是与文档一起存储的，而是通过DocValue字段一起存储的。这就是它们特别适合排序的原因。但是，当您读回文档时，它不会是字段之一，除非您 * 还 * 将该值作为字段存储在文档中。这就是为什么该示例将该值存储两次的原因。一次是NumericDocValuesField，另一次是StringField

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

 /**Tests sorting on type int */
  public void testInt() throws IOException {
    Directory dir = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random(), dir);

    Document doc = new Document();
    doc.add(new NumericDocValuesField("value", 300000));
    doc.add(newStringField("value", "300000", Field.Store.YES));
    writer.addDocument(doc);

    doc = new Document();
    doc.add(new NumericDocValuesField("value", -1));
    doc.add(newStringField("value", "-1", Field.Store.YES));
    writer.addDocument(doc);

    doc = new Document();
    doc.add(new NumericDocValuesField("value", 4));
    doc.add(newStringField("value", "4", Field.Store.YES));
    writer.addDocument(doc);

    IndexReader ir = writer.getReader();
    writer.close();

    IndexSearcher searcher = newSearcher(ir);
    Sort sort = new Sort(new SortField("value", SortField.Type.INT));

    TopDocs td = searcher.search(new MatchAllDocsQuery(), 10, sort);
    assertEquals(3, td.totalHits.value);
    // numeric order
    assertEquals("-1", searcher.doc(td.scoreDocs[0].doc).get("value"));
    assertEquals("4", searcher.doc(td.scoreDocs[1].doc).get("value"));
    assertEquals("300000", searcher.doc(td.scoreDocs[2].doc).get("value"));

    ir.close();
    dir.close();
  }

来源：GitHub上的Lucene单元测试
不幸的是，我是一个C#开发人员，而不是一个Java开发人员，所以我很难为你写一个更接近你所要求的使用java的例子，因为我还没有一个简单的方法来测试Java Lucene代码。但是我在下面提供了一个使用LuceneNet的C#例子，我想你会发现它很容易翻译成Java。

public void NumericDocValueSort() {

            Analyzer standardAnalyzer = new StandardAnalyzer(LuceneVersion.LUCENE_48);
            Directory indexDir = new RAMDirectory();
            IndexWriterConfig iwc = new IndexWriterConfig(LuceneVersion.LUCENE_48, standardAnalyzer);

            IndexWriter indexWriter = new IndexWriter(indexDir, iwc);

            Document doc = new Document();

            doc.Add(new TextField("name", "A1", Field.Store.YES));
            //doc.Add(new StoredField("number", 1000L));              //uncomment this line to optionally be able to retrieve it from the doc later, can be  done for every doc
            doc.Add(new NumericDocValuesField("number", 1000L));
            indexWriter.AddDocument(doc);

            doc.Fields.Clear();
            doc.Add(new TextField("name", "A2", Field.Store.YES));
            doc.Add(new NumericDocValuesField("number", 1001L));
            indexWriter.AddDocument(doc);

            doc.Fields.Clear();
            doc.Add(new TextField("name", "A3", Field.Store.YES));
            doc.Add(new NumericDocValuesField("number", 990L));
            indexWriter.AddDocument(doc);

            doc.Fields.Clear();
            doc.Add(new TextField("name", "A4", Field.Store.YES));
            doc.Add(new NumericDocValuesField("number", 300L));
            indexWriter.AddDocument(doc);

            indexWriter.Commit();

            IndexReader reader = indexWriter.GetReader(applyAllDeletes: true);
            IndexSearcher searcher = new IndexSearcher(reader);

            Sort sort;
            TopDocs docs;
            SortField sortField = new SortField("number", SortFieldType.INT64);
            sort = new Sort(sortField);

            docs = searcher.Search(new MatchAllDocsQuery(), 1000, sort);

            foreach (ScoreDoc scoreDoc in docs.ScoreDocs) {
                Document curDoc = searcher.Doc(scoreDoc.Doc);
                string name = curDoc.Get("name");
            }

            reader.Dispose();               //reader.close() in java
        }

我在我的机器上运行了这段代码，它在for循环中以正确的数字顺序返回文档。注意，我使用NumericDocValuesField而不是SortedNumericSortField的原因是，只有当一个文档包含多个字段值时才需要后者。您的示例没有这样做，所以NumericDocValuesField是您在这种情况下需要的。
人们经常被名称SortedNumericSortField中的单词Sorted搞糊涂。在这个上下文中，它意味着如果文档中的字段包含多个值，这些值将按排序顺序列在文档的字段中。它与需要按排序顺序列出文档的想法无关。是的，我知道，这不是最好的命名方法，有点令人困惑。总之，希望这能帮你解决问题。

lucene 是否使用整数的DocValues对索引排序？

2条答案

相关问题

热门标签

最新问答