java 如何打印Lucene文档中的所有术语?

muk1a3rh  于 2023-04-10  发布在  Java
关注(0)|答案(1)|浏览(121)

我试图打印两个docID之间的文档中的所有术语。但是我索引的一些术语没有打印出来。
对不起,因为有些东西是在西班牙写的,因为这是我的西班牙大学的一个项目。
我有这个代码:

package simpledemo;

import java.nio.file.Paths;
import java.util.List;
import java.util.ArrayList;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.index.Fields;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.similarities.Similarity;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.BytesRef;
//import org.apache.lucene.search.similarities.DefaultSimilarity; // FIXME No se pq no detecta esto si aparece en el API de Lucene

public class TopTermsInDocs {

    public static void main(String[] args) {
        // TODO Poner bien el paquete al que pertenece cd ordenemos el codigo
        String usage = "simpledemo.TopTermsInDocs"
        + "[-index INDEX_PATH] [-docID INT1-INT2] [-top N] [-outfile PATH]\n"
        + "The result will be displayed on screen and in a file to be indicated in"
        + "the -outfile path argument, will show for each document its"
        + "docId and the top n terms with its tf, df and tf x idflog10";

        String indexPath = null; // TODO mirar si poner uno por defecto o hacer que falle
        String[] range = null;
        Integer docIdStart = null;
        Integer docIdEnd = null;
        Integer topN = null;
        String outPath = null;
        System.out.println(usage);

        for (int i = 0; i < args.length; i++) {
            switch(args[i]) {
                case "-index": 
                    indexPath = args[++i];
                    break;
                case "-docID":
                    range = args[++i].split("-");
                    docIdStart = Integer.parseInt(range[0]);
                    docIdEnd = Integer.parseInt(range[1]);
                    break;
                case "-top":
                    topN = Integer.parseInt(args[++i]);
                    break;
                case "-outfile":
                    outPath = args[++i];
            } 
        }
        IndexReader reader = null; //Ls inicializo aqui a null pq sino no los voy a poder usar fuera del try
        // FIXME descomentar cd lo reconozca       DefaultSimilarity similarity = new DefaultSimilarity();
        try {
            reader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));

            int numDocs = reader.numDocs(); // Numero de documentos total

            for(int id = docIdStart; id<docIdEnd; id++) {
                System.out.println("Voy a printear el docID: " + id);
                Fields fields = reader.getTermVectors(id); //Obtengo todos los terminos del documento

                for (String fieldName : fields) {
                    Terms terms = fields.terms(fieldName);
                    TermsEnum termsEnum = terms.iterator();
                    BytesRef term = null;

                    while((term = termsEnum.next()) != null) {
                        String termText = term.utf8ToString();
                        long termFreq = termsEnum.totalTermFreq(); //Frecuencia total del termino
                        int docFreq = termsEnum.docFreq(); //Frecuencia de documento
                        int tf = (int) Math.round(termFreq / docFreq); //Frecuencia del termino en el documento
                        // FIXME descomentar cd lo reconozca    double idf = similarity.idf(docFreq, numDocs);
                        int idf = (int) Math.log(numDocs/(docFreq+1)) + 1;
                        System.out.println("Campo: " + fieldName + " - Término: " + termText + " - tf: " + tf + " - idf: " + idf);
                        //TODO primero probar si funciona y si funciona puedo hacer una funcion que devuelva una estructura con todo
                    }

                }

                System.out.println("\n\n");
            }

        } catch (Exception e) {
            // TODO: handle exception
        }




    }
}

我知道-top和-outfile选项目前还没有实现,但这对这个问题来说并不重要。
当我对单个文档执行它时,它显示:

Campo: LastModifiedTimeLucene - Término: 20230408150702014 - tf: 1 - idf: 2
Campo: contents - Término: david - tf: 1 - idf: 2
Campo: contents - Término: hola2 - tf: 1 - idf: 2
Campo: contents - Término: txt - tf: 1 - idf: 2
Campo: creationTime - Término: 2023-04-08 17:07:02 - tf: 1 - idf: 2
Campo: creationTimeLucene - Término: 20230408150702014 - tf: 1 - idf: 2
Campo: lastAccessTime - Término: 2023-04-09 01:10:26 - tf: 1 - idf: 2
Campo: lastAccessTimeLucene - Término: 20230408231026954 - tf: 1 - idf: 2
Campo: lastModifiedTime - Término: 2023-04-08 17:07:02 - tf: 1 - idf: 2

关于创建的文件我有这个功能:

void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {
        System.out.println(file.getFileName().toString());
        // TODO Añadir la funcionalidad de onlyLines
        if(config.validateFile(file.getFileName().toString()))
        {
            try (InputStream stream = Files.newInputStream(file)) {
                // make a new, empty document
                Document doc = new Document();
                
                // Add the path of the file as a field named "path". Use a
                // field that is indexed (i.e. searchable), but don't tokenize
                // the field into separate words and don't index term frequency
                // or positional information:
                Field pathField = new StringField("path", file.toString(), Field.Store.YES);
                doc.add(pathField);
                
                String contents = obtainContents(stream);
                
                FieldType tmp_field_type = new FieldType();
                
                tmp_field_type.setTokenized(true);
                tmp_field_type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
                tmp_field_type.setStoreTermVectors(contentsTermVectors);
                tmp_field_type.setStoreTermVectorPositions(contentsTermVectors);
                tmp_field_type.setStored(contentsStored);
                tmp_field_type.freeze();
                
                // Add the contents of the file to a field named "contents". Specify a Reader,
                // so that the text of the file is tokenized and indexed, but not stored.
                // Note that FileReader expects the file to be in UTF-8 encoding.
                // If that's not the case searching for special characters will fail.
                
                Field contentsField = new Field("contents", contents, tmp_field_type);
                doc.add(contentsField);

                // TODO Extender documentacion
                Field hostnameField = new StringField("hostname", InetAddress.getLocalHost().getHostName(), Field.Store.YES);
                doc.add(hostnameField);
    
                // TODO Extender documentacion
                Field threadField = new StringField("thread", Thread.currentThread().getName(), Field.Store.YES);
                doc.add(threadField);
                
                // TODO Extender documentacion
                
                BasicFileAttributes at = Files.readAttributes(file, BasicFileAttributes.class);
                String type;
                if (at.isDirectory()) type = "isDirectory";
                else if (at.isRegularFile()) type = "isRegularFile";
                else if (at.isSymbolicLink()) type = "iSSymbolicLink";
                else if (at.isOther()) type = "isOther";
                else type = "error";

                doc.add(new StringField("type", type, Field.Store.YES));
                
                // TODO Extender documentacion
                
                doc.add(new LongPoint("sizeKB", at.size())); // ! CUIDAO
                doc.add(new StoredField("sizeKB", at.size()));

                // Add the last modified date of the file a field named "modified".
                // Use a LongPoint that is indexed (i.e. efficiently filterable with
                // PointRangeQuery). This indexes to milli-second resolution, which
                // is often too fine. You could instead create a number based on
                // year/month/day/hour/minutes/seconds, down the resolution you require.
                // For example the long value 2011021714 would mean
                // February 17, 2011, 2-3 PM.
                doc.add(new LongPoint("modified", lastModified));
                doc.add(new StoredField("modified", lastModified));
    
                
                String dateFormat = "yyyy-MM-dd HH:mm:ss";
                SimpleDateFormat simpleDateFormat = new SimpleDateFormat(dateFormat);

                FileTime creationTime = at.creationTime();
                String creationTimeFormateado = simpleDateFormat.format(new Date(creationTime.toMillis()));
                doc.add(new Field("creationTime", creationTimeFormateado, TYPE_STORED));

                FileTime lastAccessTime = at.lastAccessTime();
                String lastAccessTimeFormateado = simpleDateFormat.format(new Date(lastAccessTime.toMillis()));
                doc.add(new Field("lastAccessTime", lastAccessTimeFormateado, TYPE_STORED));

                FileTime lastModifiedTime = at.lastModifiedTime();
                String lastTimeModifiedTimeFormateado = simpleDateFormat.format(new Date(lastModifiedTime.toMillis()));
                doc.add(new Field("lastModifiedTime", lastTimeModifiedTimeFormateado, TYPE_STORED));

                Date creationTimelucene = new Date(creationTime.toMillis());
                String s1 = DateTools.dateToString(creationTimelucene, DateTools.Resolution.MILLISECOND);
                doc.add(new Field("creationTimeLucene", s1, TYPE_STORED));

                Date lastAccessTimelucene = new Date(lastAccessTime.toMillis());
                String s2 = DateTools.dateToString(lastAccessTimelucene, DateTools.Resolution.MILLISECOND);
                doc.add(new Field("lastAccessTimeLucene", s2, TYPE_STORED));

                Date lastModifiedTimelucene = new Date(lastModifiedTime.toMillis());
                String s3 = DateTools.dateToString(lastModifiedTimelucene, DateTools.Resolution.MILLISECOND);
                doc.add(new Field("LastModifiedTimeLucene", s3, TYPE_STORED));
                
                if (demoEmbeddings != null) {
                    try (InputStream in = Files.newInputStream(file)) {
                        float[] vector = demoEmbeddings.computeEmbedding(
                                new BufferedReader(new InputStreamReader(in, StandardCharsets.UTF_8)));
                        doc.add(
                                new KnnVectorField("contents-vector", vector, VectorSimilarityFunction.DOT_PRODUCT));
                    }
                }
    
                if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
                    // New index, so we just add the document (no old document can be there):
                    System.out.println("adding " + file);
                    writer.addDocument(doc);
                } else {
                    // Existing index (an old copy of this document may have been indexed) so
                    // we use updateDocument instead to replace the old one matching the exact
                    // path, if present:
                    System.out.println("updating " + file);
                    writer.updateDocument(new Term("path", file.toString()), doc);
                }
            }
        }
        else
            System.out.println("Este archivo va a ser ignorado");
    }```

But I have indexed more fields for the document, like file type. Why are they not shown?
wljmcqd8

wljmcqd81#

要打印特定字段数据:
“* 来自两个docID之间的文档
您特别提到的缺失示例是
*“文件类型”**-例如:

doc.add(new StringField("type", "isDirectory", Field.Store.YES));

您的代码应该能够使用以下命令访问此字段:

Fields fields = reader.getTermVectors(id);

但是type字段不是项向量。它是StringField,即:
一个被索引但未标记化的字段:整个字符串值被索引为单个标记。
相反,您可以使用以下命令获取此字段:

reader.document(id).getFields()

例如(我选择在这里使用forEach-但你也可以使用你的循环):

for (int id = 0; id < 1; id++) {
    reader.document(id).getFields()
        .forEach(field -> System.out.println(field.name() 
        + " - " + field.stringValue()));
}

对于我的type示例,上面的代码打印:

type - isDirectory

感谢您对问题的修改。使用这些修改很困难,因为它们不提供mre-有其他方法的引用,而一些硬编码的数据可能会更有帮助-例如,像我的示例:

doc.add(new StringField("type", "isDirectory", Field.Store.YES));

这是自包含的,不依赖于任何文件或其他方法没有显示在您的问题。
如果输出中还有其他缺失字段,您可以查看这些字段的类型,并提供一个硬编码的数据示例。
甚至值得为这些问题提出一个新的、更有针对性的问题。

相关问题