Bug描述
你好,
我正在使用从Weaviate集合初始化的向量存储索引的检索器。我注意到检索到的节点分数顺序是反向的:第一个(最相关)节点的分数等于零,随着我们移动到最不相关的节点,分数增加。
我们在代码中发现LlamaIndex执行减法 1 - score,其中score是Weaviate返回的分数。但是现在Weaviate返回的是相似度分数而不是距离。我认为只有在向量搜索中(而不是混合搜索),才能返回距离而不是相似度(参见此处)。您可以使用下面提供的代码(来自Jupyter Notebook)来查看LlamaIndex给出的分数以及Weaviate返回的分数。
版本
llama-index==0.10.53
llama-index-vector-stores-weaviate==1.0.0
weaviate-client==4.6.5
重现步骤
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Document
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core.vector_stores import VectorStoreQuery
from llama_index.core.schema import TextNode
from llama_index.embeddings.text_embeddings_inference import TextEmbeddingsInference
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from llama_index.core.node_parser import SimpleNodeParser
import weaviate
import os
from transformers import AutoTokenizer, AutoModel
import tiktoken
import requests
from IPython.display import Markdown, display
# In[ ]:
# Embeddings initialization: OpenAI
embed_model = OpenAIEmbedding(model="text-embedding-3-small", api_key=os.environ.get("OPEN_AI_API_KEY"))
tokenizer = tiktoken.encoding_for_model("text-embedding-3-small").encode
# In[ ]:
tokenizer_obj = tokenizer
# The chunk_size must be compatible with the sequence length of the embed_model_obj that is used.
chunk_size = 450
chunk_overlap = 50
# Initialize a node parser that we will use in the documents parsing.
# First initialize the TokenCountingHandler with our tokenizer and the CallbackManager with our token counter.
# And then the node parser.
token_counter_handler = TokenCountingHandler(tokenizer=tokenizer_obj)
callback_manager = CallbackManager([token_counter_handler])
node_parser = SimpleNodeParser.from_defaults(chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
callback_manager=callback_manager)
# In[66]:
client = weaviate.connect_to_local()
# In[127]:
# Now that the collection is already created we just connected to it.
vector_store = WeaviateVectorStore(
weaviate_client=client, index_name="Test"
)
# In[128]:
vector_store_index = VectorStoreIndex.from_vector_store(vector_store=vector_store,
embed_model=embed_model,
transformations=[node_parser],
show_progress=True)
# In[100]:
def get_wikipedia_article_text(title):
url = "https://en.wikipedia.org/w/api.php"
params = {"action": "query", "format": "json", "prop": "extracts", "explaintext": True, "titles": title}
response = requests.get(url, params=params).json()
page = next(iter(response["query"]["pages"].values()))
return page.get("extract", "Article not found.")
python_doc_text = get_wikipedia_article_text("Python (programming language)")
lion_doc_text = get_wikipedia_article_text("Lion")
lion_paragraph = lion_doc_text[:1000]
# In[25]:
python_doc = Document(doc_id='1',
text=python_doc_text,
metadata={
"title_of_parental_document": "Python_(programming_language)",
"source": "https://en.wikipedia.org/wiki/Python_(programming_language)"
})
# In[101]:
lion_doc = Document(doc_id='2',
text=lion_paragraph,
metadata={
"title_of_parental_document": "Lion",
"source": "https://en.wikipedia.org/wiki/Lion"
})
# In[104]:
vector_store_index.insert(document=python_doc)
vector_store_index.insert(document=lion_doc)
# In[129]:
retriever = vector_store_index.as_retriever(similarity_top_k=10,
vector_store_query_mode="hybrid",
alpha=0.5)
nodes = retriever.retrieve("What is lion?")
# In[131]:
# Always the retriever returns a list of nodes in decsending order based on the score (most relevant chunks going first in the list).
# But why here the most relevant chunk has a zero score?
for node in nodes:
print(node.text)
print()
print(node.score)
print("__________________________________________________________________________________________________________")
print("__________________________________________________________________________________________________________")
print([node.score for node in nodes])
# The scores are: [0.0, 0.9217832833528519, 0.9288488179445267, 0.9365298748016357, 0.937725093215704, #0.9396311119198799, 0.9409564286470413, 0.9446112886071205, 0.9455222226679325, 0.9476451091468334]
# In[108]:
# Code to query Weaviate without LlamaIndex.
query = "what is lion?"
query_vector = embed_model.get_query_embedding(query=query)
# In[121]:
# query_vector
from weaviate.classes.query import MetadataQuery
# In[123]:
from weaviate.classes.query import MetadataQuery
response = collection.query.hybrid(
query=query,
vector=query_vector,
return_metadata=MetadataQuery(distance=True,
certainty=True,
score=True,
explain_score=True),
alpha=0.50,
limit=10,
)
x=[]
for o in response.objects:
print(o.properties)
print()
print(o.metadata)
print("_______________")
x.append(o.metadata.score)
print(x)
# Scores from weaviate:
# [1.0, 0.08082851022481918, 0.07240726053714752, 0.07034952938556671, 0.0660715326666832, #0.06566160172224045, 0.06270790100097656, 0.05426621064543724, 0.05242578685283661, 0.05184878036379814]
3条答案
按热度按时间2ledvvac1#
这个问题的出现是因为LlamaIndex在Weaviate返回的分数上执行了减法操作(
1 - score
),现在它返回的是相似度分数而不是距离。这导致了分数被反转。要解决这个问题,你可以修改代码以正确处理相似度分数。以下是如何调整分数的方法:
更新计算分数的部分代码,使其直接使用相似度分数,而不是从1中减去。
确保结果按照相关性的降序排列。
以下是调整代码的示例:
此外,如果你正在使用LlamaIndex框架中的
VectorContextRetriever
,你可以修改retrieve_from_graph
方法来正确处理分数:这个修改将确保从检索器1(https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/WeaviateIndex_auto_retriever.ipynb)2(https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-multidoc-autoretrieval/llama_index/packs/multidoc_autoretrieval/base.py)3(https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/indices/property_graph/sub_retrievers/vector.py)直接返回正确相关性顺序的结果。
1rhkuytd2#
看起来你对这个问题有相当的理解。请随时打开一个PR 👍
332nm8kg3#
你好,@logan-markewich ,谢谢!我想尝试一下。我在Weaviate集成中发现了一些问题,我打算为它们创建单独的线程并尝试解决。这样可以吗?