在余弦相似度之后,在ElasticSearch中使用术语匹配来提升文档

bnlyeluc  于 2023-04-20  发布在  ElasticSearch
关注(0)|答案(2)|浏览(159)

我使用存储在elasticsearch中的文本嵌入来获取与查询相似的文档。但我注意到,在某些情况下,我得到的文档中没有来自查询的单词,得分更高。所以我想提高具有来自查询的单词的文档的得分。我如何在elasticsearch中做到这一点?
这是我的索引

{
    "mappings": {
        "properties": {
            "question_text": {
            "type": "text"
            },
            "question_vector": {
            "type": "dense_vector",
            "dims": 768
            }
        }
    }
}

我试过这么做

{
    "query":{
        "script_score": {
            "query": {
                "bool": {
                    "must": [
                        {
                            "more_like_this": {
                                "fields": [
                                    "question_text"
                                ],
                                "like": query_text,
                                "min_term_freq": 1,
                                "max_query_terms": 12,
                                "minimum_should_match": "3<60%"
                            }
                        }
                    ]
                }
            },
            "script": {
                "source": "cosineSimilarity(params.query_vector, 'question_vector') + 1.0",
                "params": {"query_vector": query_vector}
            }
        }
    },
    "fields": [
        "question_text"
    ],
    "_source": false
}

但是现在我只能得到包含单词的文档。有没有一种方法可以做到这一点,但仍然得到不包含单词的匹配,但得分较低?

dy2hfwbg

dy2hfwbg1#

使用功能分数查询。

{
        "query": {
            "function_score": {
                "query": {
                    "bool": {
                        "must": [
                            {
                                "more_like_this": {
                                    "fields": [
                                        "question_text"
                                    ],
                                    "like": "Once upon a time",
                                    "min_doc_freq": 1,
                                    "min_term_freq": 1,
                                    "max_query_terms": 12,
                                    "minimum_should_match": "1<60%"
                                }
                            }
                        ]
                    }
                },
                "boost": "1",
                "functions": [
                    {
                        "script_score": {
                            "script": {
                                "source": "cosineSimilarity(params.query_vector, 'question_vector') + 1.0",
                                "params": {
                                    "query_vector": [
                                        -0.5,
                                        10,
                                        20
                                    ]
                                }
                            }
                        },
                        "weight": 1000
                    }
                    
                ],
                "boost_mode": "sum"
            }
        }
    }

说明:
boost -〉对整个查询进行boost
weight -〉boost for cosine function
final boost = query boost + function boost。

gpnt7bae

gpnt7bae2#

{
    "query": {
        "boosting": {
            "positive": {
                "function_score": {
                    "query": {
                        "match_all": {}
                    },
                    "script_score": {
                        "script": {
                            "source": "cosineSimilarity(params.query_vector, 'question_vector') + 1.0",
                            "params": {"query_vector": embedding}
                        },
                    }
                }
            },
            "negative": {
                "bool": {
                    "must_not": [
                        {
                            "more_like_this": {
                                "fields": [
                                    "question_text"
                                ],
                                "like": text,
                                "min_doc_freq": 0,
                                "min_term_freq": 0,
                                "max_query_terms": 12,
                                "minimum_should_match": "3<60%",
                            }
                        }
                    ]
                }
            },
            "negative_boost": 0.8
        }
    },
    "_source": "question_text"
}

该查询选择所有文档并计算余弦相似度,然后减少没有匹配项的文档的得分。

相关问题