如何使用PyLucene从Lucene 8.6.1索引中获取所有令牌的列表?

eyh26e7m  于 2022-11-07  发布在  Lucene
关注(0)|答案(1)|浏览(197)

我已经从this question得到了一些方向。我首先做了如下的索引。

import lucene
from  org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.index import IndexWriterConfig, IndexWriter, DirectoryReader
from org.apache.lucene.store import SimpleFSDirectory
from java.nio.file import Paths
from org.apache.lucene.document import Document, Field, TextField
from org.apache.lucene.util import BytesRefIterator

index_path = "./index"

lucene.initVM()

analyzer = StandardAnalyzer()
config = IndexWriterConfig(analyzer)
if len(os.listdir(index_path))>0:
    config.setOpenMode(IndexWriterConfig.OpenMode.APPEND)

store = SimpleFSDirectory(Paths.get(index_path))
writer = IndexWriter(store, config)

doc = Document()
doc.add(Field("docid", "1",  TextField.TYPE_STORED))
doc.add(Field("title", "qwe rty", TextField.TYPE_STORED))
doc.add(Field("description", "uio pas", TextField.TYPE_STORED))
writer.addDocument(doc)

writer.close()
store.close()

然后,我尝试获取一个字段的索引中的所有术语,如下所示。

store = SimpleFSDirectory(Paths.get(index_path))
reader = DirectoryReader.open(store)

尝试1:尝试使用next(),就像this question中使用的一样,它似乎是TermsEnum实现的BytesRefIterator方法。

for lrc in reader.leaves():
    terms = lrc.reader().terms('title')
    terms_enum = terms.iterator()
    while terms_enum.next():
        term = terms_enum.term()
        print(term.utf8ToString())

但是,我似乎无法访问next()方法。

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-47-6515079843a0> in <module>
      2     terms = lrc.reader().terms('title')
      3     terms_enum = terms.iterator()
----> 4     while terms_enum.next():
      5         term = terms_enum.term()
      6         print(term.utf8ToString())

AttributeError: 'TermsEnum' object has no attribute 'next'

尝试2:尝试按照this question注解中的建议更改while循环。

while next(terms_enum):
    term = terms_enum.term()
    print(term.utf8ToString())

然而,TermsEnum似乎没有被Python理解为迭代器。

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-48-d490ad78fb1c> in <module>
      2     terms = lrc.reader().terms('title')
      3     terms_enum = terms.iterator()
----> 4     while next(terms_enum):
      5         term = terms_enum.term()
      6         print(term.utf8ToString())

TypeError: 'TermsEnum' object is not an iterator

我知道我的问题可以按照this question中的建议来回答。那么我想我的问题实际上是,我如何得到TermsEnum中的所有项?

pgx2nnw8

pgx2nnw81#

我发现下面的工作从heretest_FieldEnumeration()test_Pylucene.py文件,这是在pylucene-8.6.1/test3/

for term in BytesRefIterator.cast_(terms_enum):
    print(term.utf8ToString())

很高兴接受一个比这更有解释力的答案。

相关问题