elasticsearch 稠密向量数组与余弦相似性

ds97pgxw 于 2023-03-22 发布在 ElasticSearch

关注(0)|答案(3)|浏览(305)

我想在我的文档中存储一个dense_vector数组，但这并不像其他数据类型一样有效。

PUT my_index
{
  "mappings": {
    "properties": {
      "my_vectors": {
        "type": "dense_vector",
        "dims": 3  
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [[0.5, 10, 6], [-0.5, 10, 10]]
}

退货：

'1 document(s) failed to index.',
    {'_index': 'my_index', '_type': '_doc', '_id': 'some_id', 'status': 400, 'error': 
      {'type': 'mapper_parsing_exception', 'reason': 'failed to parse', 'caused_by': 
        {'type': 'parsing_exception', 
         'reason': 'Failed to parse object: expecting token of type [VALUE_NUMBER] but found [START_ARRAY]'
        }
      }
    }

我如何实现这一点呢？不同的文档将具有可变数量的向量，但永远不会超过几个。
另外，我想通过对数组中的每个值执行cosineSimilarity来查询它，下面的代码是当文档中只有一个向量时我通常是如何做的。

"script_score": {
    "query": {
        "match_all": {}
    },
    "script": {
        "source": "(1.0+cosineSimilarity(params.query_vector, doc['my_vectors']))",
        "params": {"query_vector": query_vector}
    }
}

理想情况下，我希望最接近的相似性或平均值。

elasticsearch

来源：https://stackoverflow.com/questions/61376317/dense-vector-array-and-cosine-similarity

3条答案

按热度按时间

yks3o0rb1#

dense_vector数据类型要求每个文档有一个数值数组，如下所示：

PUT my_index/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [0.5, 10, 6]
}

要存储任意数量的向量，可以将my_vector字段设置为“嵌套”类型，其中包含一个对象数组，每个对象包含一个向量：

PUT my_index
{
  "mappings": {
    "properties": {
      "my_vectors": {
        "type": "nested",
        "properties": {
          "vector": {
            "type": "dense_vector",
            "dims": 3  
          }
        }
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [
    {"vector": [0.5, 10, 6]}, 
    {"vector": [-0.5, 10, 10]}
  ]
}

编辑

然后，要查询文档，可以使用以下命令（从ES v7.6.1开始）

{
  "query": {
    "nested": {
      "path": "my_vectors",
      "score_mode": "max", 
      "query": {
        "function_score": {
          "script_score": {
            "script": {
              "source": "(1.0+cosineSimilarity(params.query_vector, 'my_vectors.vector'))",
              "params": {"query_vector": query_vector}
            }
          }
        }
      }
    }
  }
}

需要注意的几点：

查询需要 Package 在nested声明中（由于使用嵌套对象来存储向量）
由于嵌套对象是独立的Lucene文档，因此嵌套对象会单独评分，默认情况下，父文档会被分配匹配嵌套文档的平均得分。您可以指定嵌套属性score_mode来更改评分行为。对于您的情况，“max”将根据描述最相似文档的最大余弦相似度得分进行评分。
如果您想查看每个嵌套向量的得分，可以使用嵌套属性inner_hits。
如果有人好奇为什么余弦相似度得分增加了+1.0，那是因为Cos. Sim.计算值[-1，1]，但ElasticSearch不能有负得分。因此，得分被转换为[0，2]。

赞(0）回复(0）举报 2023-03-22

33qvvth12#

dense_vector数据类型用于
存储float值的密集向量（来自documentation）.... dense_vector字段是单值字段。
在你的例子中，你想在同一个属性中索引多个向量。但是正如文档中所说，你的字段必须是单值的。如果你的文档有多个向量，它们需要在不同的属性中调度。
没有解决方法：（
因此，您需要在不同的字段中分派向量，然后在脚本中使用循环并保留最合适的值。

赞(0）回复(0）举报 2023-03-22

kyvafyod3#

我通过尝试在我的文档中有一组向量来获得这篇文章。
当我这样做时：

"mappings": {
    "properties": {
        "vectors": {
            "type": "nested",
            "properties": {
                "vector": {
                    "type": "dense_vector",
                    "dims": 768,
                    "index": "true",
                    "similarity": "cosine"
                }
            }   
        },
        "my_text" : {
            "type" : "keyword"
        }
    }
}

我得到：
BadRequestError: BadRequestError(400, 'illegal_argument_exception', "[dense_vector] fields cannot be indexed if they're within [nested] mappings")
如果我删除index: true和"similarity": "cosine"，那么问题就消失了（但我将无法使用knn，这是我的主要目标）。
希望这能帮到什么人。

赞(0）回复(0）举报 2023-03-22

我来回答

elasticsearch 稠密向量数组与余弦相似性

3条答案

相关问题

热门标签

最新问答