llama_index [Bug]:在Weaviate集成中检索到的节点中的分数是反向顺序的,

qhhrdooz  于 4个月前  发布在  其他
关注(0)|答案(3)|浏览(36)

Bug描述

你好,
我正在使用从Weaviate集合初始化的向量存储索引的检索器。我注意到检索到的节点分数顺序是反向的:第一个(最相关)节点的分数等于零,随着我们移动到最不相关的节点,分数增加。
我们在代码中发现LlamaIndex执行减法 1 - score,其中score是Weaviate返回的分数。但是现在Weaviate返回的是相似度分数而不是距离。我认为只有在向量搜索中(而不是混合搜索),才能返回距离而不是相似度(参见此处)。您可以使用下面提供的代码(来自Jupyter Notebook)来查看LlamaIndex给出的分数以及Weaviate返回的分数。

版本

llama-index==0.10.53
llama-index-vector-stores-weaviate==1.0.0
weaviate-client==4.6.5

重现步骤

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Document
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core.vector_stores import VectorStoreQuery
from llama_index.core.schema import TextNode

from llama_index.embeddings.text_embeddings_inference import TextEmbeddingsInference
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from llama_index.core.node_parser import SimpleNodeParser

import weaviate
import os

from transformers import AutoTokenizer, AutoModel
import tiktoken
import requests
from IPython.display import Markdown, display

# In[ ]:
# Embeddings initialization: OpenAI
embed_model = OpenAIEmbedding(model="text-embedding-3-small", api_key=os.environ.get("OPEN_AI_API_KEY"))
tokenizer = tiktoken.encoding_for_model("text-embedding-3-small").encode

# In[ ]:
tokenizer_obj = tokenizer
# The chunk_size must be compatible with the sequence length of the embed_model_obj that is used.
chunk_size = 450
chunk_overlap = 50
# Initialize a node parser that we will use in the documents parsing.
# First initialize the TokenCountingHandler with our tokenizer and the CallbackManager with our token counter.
# And then the node parser.
token_counter_handler = TokenCountingHandler(tokenizer=tokenizer_obj)
callback_manager = CallbackManager([token_counter_handler])
node_parser = SimpleNodeParser.from_defaults(chunk_size=chunk_size,
                                                  chunk_overlap=chunk_overlap,
                                                  callback_manager=callback_manager)

# In[66]:
client = weaviate.connect_to_local()

# In[127]:
# Now that the collection is already created we just connected to it.
vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name="Test"
)

# In[128]:
vector_store_index = VectorStoreIndex.from_vector_store(vector_store=vector_store,
                                                        embed_model=embed_model,
                                                        transformations=[node_parser],
                                                        show_progress=True)

# In[100]:
def get_wikipedia_article_text(title):
    url = "https://en.wikipedia.org/w/api.php"
    params = {"action": "query", "format": "json", "prop": "extracts", "explaintext": True, "titles": title}
    response = requests.get(url, params=params).json()
    page = next(iter(response["query"]["pages"].values()))
    return page.get("extract", "Article not found.")

python_doc_text = get_wikipedia_article_text("Python (programming language)")
lion_doc_text = get_wikipedia_article_text("Lion")
lion_paragraph = lion_doc_text[:1000]

# In[25]:
python_doc = Document(doc_id='1',
                      text=python_doc_text,
                      metadata={
                           "title_of_parental_document": "Python_(programming_language)",
                           "source": "https://en.wikipedia.org/wiki/Python_(programming_language)"
                       })

# In[101]:
lion_doc = Document(doc_id='2',
                    text=lion_paragraph,
                    metadata={
                       "title_of_parental_document": "Lion",
                       "source": "https://en.wikipedia.org/wiki/Lion"
                   })

# In[104]:
vector_store_index.insert(document=python_doc)
vector_store_index.insert(document=lion_doc)

# In[129]:
retriever = vector_store_index.as_retriever(similarity_top_k=10, 
                                            vector_store_query_mode="hybrid",
                                            alpha=0.5)
nodes = retriever.retrieve("What is lion?")

# In[131]:
# Always the retriever returns a list of nodes in decsending order based on the score (most relevant chunks going first in the list).
# But why here the most relevant chunk has a zero score?
for node in nodes:
    print(node.text)
    print()
    print(node.score)
    print("__________________________________________________________________________________________________________")
    print("__________________________________________________________________________________________________________")

print([node.score for node in nodes])
# The scores are: [0.0,  0.9217832833528519,  0.9288488179445267,  0.9365298748016357,  0.937725093215704,    #0.9396311119198799,  0.9409564286470413,  0.9446112886071205,  0.9455222226679325,  0.9476451091468334]


# In[108]:

# Code to query Weaviate without LlamaIndex.
query = "what is lion?"
query_vector = embed_model.get_query_embedding(query=query)

# In[121]:
# query_vector
from weaviate.classes.query import MetadataQuery

# In[123]:
from weaviate.classes.query import MetadataQuery
response = collection.query.hybrid(
    query=query,
    vector=query_vector,
    return_metadata=MetadataQuery(distance=True,
                             certainty=True,
                             score=True,
                             explain_score=True),
    alpha=0.50,
    limit=10,
)
x=[]
for o in response.objects:
    print(o.properties)
    print()
    print(o.metadata)
    print("_______________")
   x.append(o.metadata.score)

print(x)

# Scores from weaviate:
# [1.0,  0.08082851022481918,  0.07240726053714752,  0.07034952938556671,  0.0660715326666832,  #0.06566160172224045,  0.06270790100097656,  0.05426621064543724,  0.05242578685283661,  0.05184878036379814]
2ledvvac

2ledvvac1#

这个问题的出现是因为LlamaIndex在Weaviate返回的分数上执行了减法操作(1 - score),现在它返回的是相似度分数而不是距离。这导致了分数被反转。
要解决这个问题,你可以修改代码以正确处理相似度分数。以下是如何调整分数的方法:

  1. 修改分数计算:
    更新计算分数的部分代码,使其直接使用相似度分数,而不是从1中减去。
  2. 反转检索结果的顺序:
    确保结果按照相关性的降序排列。
    以下是调整代码的示例:
# Modify the score calculation
def adjust_scores(scores):
    return scores  # Use the similarity scores directly

# Reverse the order of the retrieved results
response = retriever.retrieve("What is lion?")
response = response[::-1]  # Reverse the order of the results
print(response[0])

此外,如果你正在使用LlamaIndex框架中的VectorContextRetriever,你可以修改retrieve_from_graph方法来正确处理分数:

class VectorContextRetriever(BasePGRetriever):
    # ... (other parts of the class)

    def retrieve_from_graph(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        vector_store_query = self._get_vector_store_query(query_bundle)

        triplets = []
        kg_ids = []
        new_scores = []
        if self._graph_store.supports_vector_queries:
            result = self._graph_store.vector_query(vector_store_query)
            if len(result) != 2:
                raise ValueError("No nodes returned by vector_query")
            kg_nodes, scores = result

            kg_ids = [node.id for node in kg_nodes]
            triplets = self._graph_store.get_rel_map(
                kg_nodes, depth=self._path_depth, ignore_rels=[KG_SOURCE_REL]
            )
        elif self._vector_store is not None:
            query_result = self._vector_store.query(vector_store_query)
            if query_result.nodes is not None and query_result.similarities is not None:
                kg_ids = self._get_kg_ids(query_result.nodes)
                scores = query_result.similarities
                kg_nodes = self._graph_store.get(ids=kg_ids)
                triplets = self._graph_store.get_rel_map(
                    kg_nodes, depth=self._path_depth, ignore_rels=[KG_SOURCE_REL]
                )

            elif query_result.ids is not None and query_result.similarities is not None:
                kg_ids = query_result.ids
                scores = query_result.similarities
                kg_nodes = self._graph_store.get(ids=kg_ids)
                triplets = self._graph_store.get_rel_map(
                    kg_nodes, depth=self._path_depth, ignore_rels=[KG_SOURCE_REL]
                )

        for triplet in triplets:
            score1 = (
                scores[kg_ids.index(triplet[0].id)] if triplet[0].id in kg_ids else 0.0
            )
            score2 = (
                scores[kg_ids.index(triplet[2].id)] if triplet[2].id in kg_ids else 0.0
            )
            new_scores.append(max(score1, score2))

        assert len(triplets) == len(new_scores)

        # sort by score
        top_k = sorted(zip(triplets, new_scores), key=lambda x: x[1], reverse=True)

        return self._get_nodes_with_score([x[0] for x in top_k], [x[1] for x in top_k])[::-1]  # Reverse the order of the results

这个修改将确保从检索器1(https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/WeaviateIndex_auto_retriever.ipynb)2(https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-multidoc-autoretrieval/llama_index/packs/multidoc_autoretrieval/base.py)3(https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/indices/property_graph/sub_retrievers/vector.py)直接返回正确相关性顺序的结果。

  • 要继续与Dosu交谈,请提及@dosu。*
1rhkuytd

1rhkuytd2#

看起来你对这个问题有相当的理解。请随时打开一个PR 👍

332nm8kg

332nm8kg3#

你好,@logan-markewich ,谢谢!我想尝试一下。我在Weaviate集成中发现了一些问题,我打算为它们创建单独的线程并尝试解决。这样可以吗?

相关问题