使用“unique”过滤器，Elasticsearch错误地分析令牌

aamkag61 于 2023-04-05 发布在 ElasticSearch

关注(0)|答案(2)|浏览(192)

我一直尝试在分析器中使用uniquetoken过滤器，但它在评分时继续使用重复的token。

分析器：

{
    "settings": {
        "analysis": {
            "analyzer": {
                "tnved_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",    
                        "stemmer",
                        "unique"
                    ]
                }
            }           
        }
    },
    "mappings": {
        "properties": {
            "NAME": {
                "type": "text",
                "analyzer": "tnved_analyzer"
            },
            "CODE": {
                "type": "keyword"
            }
        }
    }
}

请求：

{
  "query": {
    "match_phrase": {
      "NAME": "Pork fresh or chilled"
    }
  }
}

响应：

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
      **  "max_score": 14.432465,**
        "hits": [
            {
                "_index": "tnved14_code",
                "_type": "_doc",
                "_id": "1oajS4cBEkrkvkWeGRXx",
                "_score": 14.432465,
                "_source": {
                    "CODE": "0203",
                    "NAME": **"Pork fresh or chilled"**
                }
            }
        ]
    }
}

完全一致的分数=14.432465我期望对于请求**“Pig肉新鲜或冷藏”获得相同的分数（因为这里的代币将与上面的请求“Pig肉新鲜或冷藏”**相同）：

Pig肉
新鲜
或
激冷的

但我的分数高出两倍：28.864931
我要得到14.432465。怎么了？
我要得到14.432465。怎么了？

elasticsearch

来源：https://stackoverflow.com/questions/75928692/using-unique-filter-elasticsearch-analyzes-tokens-incorrectly

2条答案

按热度按时间

0tdrvxhp1#

使用analyze API检查文本，了解它是如何标记的。您可以了解它是如何评分的。

GET index_name/_analyze
{
  "text": "Pork fresh or chilled",
  "analyzer": "tnved_analyzer"
}

此外，explain API将帮助您了解分数是如何计算的。

赞(0）回复(0）举报 2023-04-05

lpwwtiir2#

有几个因素定义了文档的得分：

TF（术语频率）-术语在单个字段中出现的次数越多，其相关性越高。
IDF（inverse document frequency）-包含搜索词的文档越多，相关性越低。
字段长度-较小的字段自然比较长的字段更相关。

推荐阅读post

赞(0）回复(0）举报 2023-04-05

我来回答

使用“unique”过滤器，Elasticsearch错误地分析令牌

分析器：

请求：

响应：

2条答案

相关问题

热门标签

最新问答