如何从Lucene的特定字段中获取唯一术语的列表？

lawou6xi 于 2022-11-07 发布在 Lucene

关注(0)|答案(6)|浏览(198)

我有一个来自一个大型语料库的索引，其中有几个字段。只有一个字段包含文本。我需要根据这个字段从整个索引中提取唯一的单词。有人知道我如何在java中使用Lucene来完成这一任务吗？

lucene

来源：https://stackoverflow.com/questions/8910008/how-can-i-get-the-list-of-unique-terms-from-a-specific-field-in-lucene

6条答案

按热度按时间

cgyqldqp1#

如果你使用的是Lucene 4.0 api，你需要从索引读取器中获取字段。然后，Fields提供了获取索引中每个字段的术语的方法。下面是一个如何做到这一点的示例：

Fields fields = MultiFields.getFields(indexReader);
        Terms terms = fields.terms("field");
        TermsEnum iterator = terms.iterator(null);
        BytesRef byteRef = null;
        while((byteRef = iterator.next()) != null) {
            String term = new String(byteRef.bytes, byteRef.offset, byteRef.length);

        }

最后，对于新版本的Lucene，您可以从BytesRef调用中获取字符串：

byteRef.utf8ToString();

而不是

new String(byteRef.bytes, byteRef.offset, byteRef.length);

如果你想得到单据频率，你可以做：

int docFreq = iterator.docFreq();

赞(0）回复(0）举报 2022-11-07

neskvpey2#

您要查找术语向量（字段中所有单词的集合以及每个单词的使用次数，不包括停用词），您将对索引中的每个文档使用IndexReader的getTermFreqVector（docid，field），并使用它们填充HashSet。
另一种方法是使用terms（），只为您感兴趣的字段选择术语：

IndexReader reader = IndexReader.open(index);
TermEnum terms = reader.terms();
Set<String> uniqueTerms = new HashSet<String>();
while (terms.next()) {
        final Term term = terms.term();
        if (term.field().equals("field_name")) {
                uniqueTerms.add(term.text());
        }
}

这不是最优的解决方案，你在阅读然后丢弃所有其他字段。Lucene 4中有一个类Fields，它只返回单个字段的terms（field）。

赞(0）回复(0）举报 2022-11-07

xj3cbfub3#

同样的结果，只是更简洁一点，就是使用lucene-suggest包中的LuceneDictionary。它通过返回BytesRefIterator.EMPTY来处理不包含任何项的字段。这将为您节省一个NPE：）

LuceneDictionary ld = new LuceneDictionary( indexReader, "field" );
    BytesRefIterator iterator = ld.getWordsIterator();
    BytesRef byteRef = null;
    while ( ( byteRef = iterator.next() ) != null )
    {
        String term = byteRef.utf8ToString();
    }

赞(0）回复(0）举报 2022-11-07

osh3o9ms4#

从Lucene 7+开始，以上链接和一些相关链接已过时。
以下是最新版本：

// IndexReader has leaves, you'll iterate through those
int leavesCount = reader.leaves().size();
final String fieldName = "content";

for(int l = 0; l < leavesCount; l++) {
  System.out.println("l: " + l);
  // specify the field here ----------------------------->
  TermsEnum terms = reader.leaves().get(l).reader().terms(fieldName).iterator();
  // this stops at 20 just to sample the head
  for(int i = 0; i < 20; i++) {
    // and to get it out, here -->
    final Term content = new Term(fieldName, BytesRef.deepCopyOf(terms.next()));
    System.out.println("i: " + i + ", term: " + content);
  }
}

赞(0）回复(0）举报 2022-11-07

c9x0cxw05#

使用TermsEnum和terms.next()的答案有一个细微的错误，这是因为TermsEnum已经指向了第一个术语，所以while(terms.next())将导致第一个术语被跳过。
请改用for循环：

TermEnum terms = reader.terms();
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
    // do something with the term
}

要修改已接受答案中的代码：

IndexReader reader = IndexReader.open(index);
TermEnum terms = reader.terms();
Set<String> uniqueTerms = new HashSet<String>();
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
        if (term.field().equals("field_name")) {
                uniqueTerms.add(term.text());
        }
}

赞(0）回复(0）举报 2022-11-07

dvtswwa36#

与@pokeRex110的解决方案相比略有不同（使用Lucene 9.3.0进行测试）

Terms terms = MultiTerms.getTerms(indexReader, "title");
if (terms != null) {
    TermsEnum iter = terms.iterator();
    BytesRef byteRef = null;
    while ((byteRef = iter.next()) != null) {
        System.out.printf("%s (freq=%s)%n", 
            byteRef.utf8ToString(), 
            iter.docFreq()
        );
    }
}

赞(0）回复(0）举报 2022-11-07

我来回答

如何从Lucene的特定字段中获取唯一术语的列表？

6条答案

相关问题

热门标签

最新问答