我试图打印两个docID之间的文档中的所有术语。但是我索引的一些术语没有打印出来。
对不起,因为有些东西是在西班牙写的,因为这是我的西班牙大学的一个项目。
我有这个代码:
package simpledemo;
import java.nio.file.Paths;
import java.util.List;
import java.util.ArrayList;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.index.Fields;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.similarities.Similarity;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.BytesRef;
//import org.apache.lucene.search.similarities.DefaultSimilarity; // FIXME No se pq no detecta esto si aparece en el API de Lucene
public class TopTermsInDocs {
public static void main(String[] args) {
// TODO Poner bien el paquete al que pertenece cd ordenemos el codigo
String usage = "simpledemo.TopTermsInDocs"
+ "[-index INDEX_PATH] [-docID INT1-INT2] [-top N] [-outfile PATH]\n"
+ "The result will be displayed on screen and in a file to be indicated in"
+ "the -outfile path argument, will show for each document its"
+ "docId and the top n terms with its tf, df and tf x idflog10";
String indexPath = null; // TODO mirar si poner uno por defecto o hacer que falle
String[] range = null;
Integer docIdStart = null;
Integer docIdEnd = null;
Integer topN = null;
String outPath = null;
System.out.println(usage);
for (int i = 0; i < args.length; i++) {
switch(args[i]) {
case "-index":
indexPath = args[++i];
break;
case "-docID":
range = args[++i].split("-");
docIdStart = Integer.parseInt(range[0]);
docIdEnd = Integer.parseInt(range[1]);
break;
case "-top":
topN = Integer.parseInt(args[++i]);
break;
case "-outfile":
outPath = args[++i];
}
}
IndexReader reader = null; //Ls inicializo aqui a null pq sino no los voy a poder usar fuera del try
// FIXME descomentar cd lo reconozca DefaultSimilarity similarity = new DefaultSimilarity();
try {
reader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));
int numDocs = reader.numDocs(); // Numero de documentos total
for(int id = docIdStart; id<docIdEnd; id++) {
System.out.println("Voy a printear el docID: " + id);
Fields fields = reader.getTermVectors(id); //Obtengo todos los terminos del documento
for (String fieldName : fields) {
Terms terms = fields.terms(fieldName);
TermsEnum termsEnum = terms.iterator();
BytesRef term = null;
while((term = termsEnum.next()) != null) {
String termText = term.utf8ToString();
long termFreq = termsEnum.totalTermFreq(); //Frecuencia total del termino
int docFreq = termsEnum.docFreq(); //Frecuencia de documento
int tf = (int) Math.round(termFreq / docFreq); //Frecuencia del termino en el documento
// FIXME descomentar cd lo reconozca double idf = similarity.idf(docFreq, numDocs);
int idf = (int) Math.log(numDocs/(docFreq+1)) + 1;
System.out.println("Campo: " + fieldName + " - Término: " + termText + " - tf: " + tf + " - idf: " + idf);
//TODO primero probar si funciona y si funciona puedo hacer una funcion que devuelva una estructura con todo
}
}
System.out.println("\n\n");
}
} catch (Exception e) {
// TODO: handle exception
}
}
}
我知道-top和-outfile选项目前还没有实现,但这对这个问题来说并不重要。
当我对单个文档执行它时,它显示:
Campo: LastModifiedTimeLucene - Término: 20230408150702014 - tf: 1 - idf: 2
Campo: contents - Término: david - tf: 1 - idf: 2
Campo: contents - Término: hola2 - tf: 1 - idf: 2
Campo: contents - Término: txt - tf: 1 - idf: 2
Campo: creationTime - Término: 2023-04-08 17:07:02 - tf: 1 - idf: 2
Campo: creationTimeLucene - Término: 20230408150702014 - tf: 1 - idf: 2
Campo: lastAccessTime - Término: 2023-04-09 01:10:26 - tf: 1 - idf: 2
Campo: lastAccessTimeLucene - Término: 20230408231026954 - tf: 1 - idf: 2
Campo: lastModifiedTime - Término: 2023-04-08 17:07:02 - tf: 1 - idf: 2
关于创建的文件我有这个功能:
void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {
System.out.println(file.getFileName().toString());
// TODO Añadir la funcionalidad de onlyLines
if(config.validateFile(file.getFileName().toString()))
{
try (InputStream stream = Files.newInputStream(file)) {
// make a new, empty document
Document doc = new Document();
// Add the path of the file as a field named "path". Use a
// field that is indexed (i.e. searchable), but don't tokenize
// the field into separate words and don't index term frequency
// or positional information:
Field pathField = new StringField("path", file.toString(), Field.Store.YES);
doc.add(pathField);
String contents = obtainContents(stream);
FieldType tmp_field_type = new FieldType();
tmp_field_type.setTokenized(true);
tmp_field_type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
tmp_field_type.setStoreTermVectors(contentsTermVectors);
tmp_field_type.setStoreTermVectorPositions(contentsTermVectors);
tmp_field_type.setStored(contentsStored);
tmp_field_type.freeze();
// Add the contents of the file to a field named "contents". Specify a Reader,
// so that the text of the file is tokenized and indexed, but not stored.
// Note that FileReader expects the file to be in UTF-8 encoding.
// If that's not the case searching for special characters will fail.
Field contentsField = new Field("contents", contents, tmp_field_type);
doc.add(contentsField);
// TODO Extender documentacion
Field hostnameField = new StringField("hostname", InetAddress.getLocalHost().getHostName(), Field.Store.YES);
doc.add(hostnameField);
// TODO Extender documentacion
Field threadField = new StringField("thread", Thread.currentThread().getName(), Field.Store.YES);
doc.add(threadField);
// TODO Extender documentacion
BasicFileAttributes at = Files.readAttributes(file, BasicFileAttributes.class);
String type;
if (at.isDirectory()) type = "isDirectory";
else if (at.isRegularFile()) type = "isRegularFile";
else if (at.isSymbolicLink()) type = "iSSymbolicLink";
else if (at.isOther()) type = "isOther";
else type = "error";
doc.add(new StringField("type", type, Field.Store.YES));
// TODO Extender documentacion
doc.add(new LongPoint("sizeKB", at.size())); // ! CUIDAO
doc.add(new StoredField("sizeKB", at.size()));
// Add the last modified date of the file a field named "modified".
// Use a LongPoint that is indexed (i.e. efficiently filterable with
// PointRangeQuery). This indexes to milli-second resolution, which
// is often too fine. You could instead create a number based on
// year/month/day/hour/minutes/seconds, down the resolution you require.
// For example the long value 2011021714 would mean
// February 17, 2011, 2-3 PM.
doc.add(new LongPoint("modified", lastModified));
doc.add(new StoredField("modified", lastModified));
String dateFormat = "yyyy-MM-dd HH:mm:ss";
SimpleDateFormat simpleDateFormat = new SimpleDateFormat(dateFormat);
FileTime creationTime = at.creationTime();
String creationTimeFormateado = simpleDateFormat.format(new Date(creationTime.toMillis()));
doc.add(new Field("creationTime", creationTimeFormateado, TYPE_STORED));
FileTime lastAccessTime = at.lastAccessTime();
String lastAccessTimeFormateado = simpleDateFormat.format(new Date(lastAccessTime.toMillis()));
doc.add(new Field("lastAccessTime", lastAccessTimeFormateado, TYPE_STORED));
FileTime lastModifiedTime = at.lastModifiedTime();
String lastTimeModifiedTimeFormateado = simpleDateFormat.format(new Date(lastModifiedTime.toMillis()));
doc.add(new Field("lastModifiedTime", lastTimeModifiedTimeFormateado, TYPE_STORED));
Date creationTimelucene = new Date(creationTime.toMillis());
String s1 = DateTools.dateToString(creationTimelucene, DateTools.Resolution.MILLISECOND);
doc.add(new Field("creationTimeLucene", s1, TYPE_STORED));
Date lastAccessTimelucene = new Date(lastAccessTime.toMillis());
String s2 = DateTools.dateToString(lastAccessTimelucene, DateTools.Resolution.MILLISECOND);
doc.add(new Field("lastAccessTimeLucene", s2, TYPE_STORED));
Date lastModifiedTimelucene = new Date(lastModifiedTime.toMillis());
String s3 = DateTools.dateToString(lastModifiedTimelucene, DateTools.Resolution.MILLISECOND);
doc.add(new Field("LastModifiedTimeLucene", s3, TYPE_STORED));
if (demoEmbeddings != null) {
try (InputStream in = Files.newInputStream(file)) {
float[] vector = demoEmbeddings.computeEmbedding(
new BufferedReader(new InputStreamReader(in, StandardCharsets.UTF_8)));
doc.add(
new KnnVectorField("contents-vector", vector, VectorSimilarityFunction.DOT_PRODUCT));
}
}
if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
// New index, so we just add the document (no old document can be there):
System.out.println("adding " + file);
writer.addDocument(doc);
} else {
// Existing index (an old copy of this document may have been indexed) so
// we use updateDocument instead to replace the old one matching the exact
// path, if present:
System.out.println("updating " + file);
writer.updateDocument(new Term("path", file.toString()), doc);
}
}
}
else
System.out.println("Este archivo va a ser ignorado");
}```
But I have indexed more fields for the document, like file type. Why are they not shown?
1条答案
按热度按时间wljmcqd81#
要打印特定字段数据:
“* 来自两个docID之间的文档 ”
您特别提到的缺失示例是*“文件类型”**-例如:
您的代码应该能够使用以下命令访问此字段:
但是
type
字段不是项向量。它是StringField,即:一个被索引但未标记化的字段:整个字符串值被索引为单个标记。
相反,您可以使用以下命令获取此字段:
例如(我选择在这里使用
forEach
-但你也可以使用你的循环):对于我的
type
示例,上面的代码打印:感谢您对问题的修改。使用这些修改很困难,因为它们不提供mre-有其他方法的引用,而一些硬编码的数据可能会更有帮助-例如,像我的示例:
这是自包含的,不依赖于任何文件或其他方法没有显示在您的问题。
如果输出中还有其他缺失字段,您可以查看这些字段的类型,并提供一个硬编码的数据示例。
甚至值得为这些问题提出一个新的、更有针对性的问题。