我使用Elastic来搜索pdf文件。pdf文件内容的其中一个字段是doridat,它是一个整数形式的日期。最新的文档应该得到更高的分数(更高的排名)。这意味着doridat字段的值越高,分数应该越高。只有在attachment.content和doridat中搜索的结果会影响分数。
如何强制计分以整合字段(doridat)值?
我的疑问:
GET /attachments/_search
{
"size": 2,
"from": 0,
"query": {
"wildcard": {
"attachment.content": {
"value": "*berg*",
"rewrite": "scoring_boolean"
}
}
},
"highlight":{
"fields":{
"attachment.content":{}
}
},
"_source": {
"excludes": "attachment.content"
}
}
我的Map:
{
"attachments" : {
"mappings" : {
"properties" : {
"attachment" : {
"properties" : {
"author" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"content_length" : {
"type" : "long"
},
"content_type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"creator_tool" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"date" : {
"type" : "date"
},
"description" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"detect_language" : {
"type" : "boolean"
},
"format" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"indexed_chars" : {
"type" : "long"
},
"keywords" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"language" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"metadata_date" : {
"type" : "date"
},
"modified" : {
"type" : "date"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"daname" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"do__nr" : {
"type" : "integer"
},
"do_typ" : {
"type" : "integer"
},
"doext" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"doname" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"donr" : {
"type" : "integer"
},
"doridat" : {
"type" : "integer"
},
"dowww" : {
"type" : "integer"
},
"id" : {
"type" : "integer"
},
"path" : {
"type" : "text",
"analyzer" : "windows_path_hierarchy_analyzer"
}
}
}
}
}
1条答案
按热度按时间9rbhqvlz1#
我认为通配符总是返回
1.0
作为匹配项(即使匹配了不止一次)。Rank feature看起来很适合您的用例。您需要复制
doridat
字段,并使用rank_feature
字段类型对其进行索引。您将能够在Rank feature query中使用该字段。您使用的是哪个Elasticsearch版本?另一个选择是使用Script score query。您基本上可以在脚本中返回
doridat
,因为wildcard总是返回1.0
作为score。您可以使用N-gram tokenizer作为attachment.content
,以实现类似于通配符的查询。当您使用match
而不是wildcard
时,它将对匹配项进行更好的评分。文档声明排名功能具有更好的性能(在搜索时可以跳过文档)。